SQLite character comparisons

classic Classic list List threaded Threaded
26 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: SQLite character comparisons

John Elrick-2
Fowler, Jeff wrote:
> Hello All,
>  
> Not trying to be antagonistic, but I'm curious to know how many of you agree with Darren's sentiments on this issue. To restate briefly, ANSI SQL-92 specifies that when comparing two character fields, trailing spaces should be ignored.

My $0.02.

If we are ignoring trailing spaces, then by definition:

' ' = ''

and for that matter:

'          ' = '  '

whereas

'xxxxxxxxxx' <> 'xxx'

Therefore, from the standpoint of arbitrary text,

A may or may not actually be identical to B when A = B

I don't consider that type of result to be predicable and therefore
would respectfully state that the standard is a potential source of bugs.

What could be useful is a function:

SELECT * FROM FOO WHERE TRIM(BAR) = TRIM(:bar)

or even a shortcut function along the lines of:

SELECT * FROM FOO WHERE TEXT_EQUIVALENT(BAR, :bar)

However, in my opinion, the operator '=' is better reserved as defining
strict equality.


John

-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: SQLite character comparisons

Dennis Cote
In reply to this post by Fowler, Jeff
Fowler, Jeff wrote:
> Hello All,
>  
> Not trying to be antagonistic, but I'm curious to know how many of you agree with Darren's sentiments on this issue. To restate briefly, ANSI SQL-92 specifies that when comparing two character fields, trailing spaces should be ignored. Correct me if I'm wrong Darren, but you feel this is a bad decision, and in fact SQLite's implementation of character comparison (respecting trailing spaces) is superior to ANSI's specs. Keep in mind this is not some obscure issue that can be subject to different interpretations by different vendors; it's very clearly stated: "The ANSI standard requires padding for the character strings used in comparisons so that their lengths match before comparing them."
>  
Jeff,

I think you are mistaken about what the ANSI spec says.

There are two string types in ANSI SQL, character strings (which come is
several subtypes), and binary strings. The following excerpts are taken
from the SQL:1999 spec.

Section 4.2.1 Character Strings and Collations describes the operations
on character strings. It describes comparisons as
> Given a collating sequence, two character strings are identical if and
> only if they are equal in
> accordance with the comparison rules specified in Subclause 8.2,
> ‘‘<comparison predicate>’’. The
> collating sequence used for a particular comparison is determined as
> in Subclause 4.2.3, ‘‘Rules
> determining collating sequence usage’’.
Binary strings are defined in Section 4.3 as;
> A binary string is a sequence of octets that does not have either a
> character set or collation associated
> with it.
And their comparison is detailed in 4.3.1 as;
> All binary strings are mutually comparable. A binary string is
> identical to another binary string
> if and only if it is equal to that binary string in accordance with
> the comparison rules specified in
> Subclause 8.2, ‘‘<comparison predicate>’’.

General Rules 3 and 4 of section 8.2 <comparison predicate> describe the
comparison of these strings. I have copied these sections below.

> 3) The comparison of two character strings is determined as follows:
>
> a) Let CS be the collating sequence indicated in Subclause 4.2.3,
> ‘‘Rules determining collating
> sequence usage’’, based on the declared types of the two character
> strings.
>
> b) If the length in characters of X is not equal to the length in
> characters of Y, then the shorter
> string is effectively replaced, for the purposes of comparison, with a
> copy of itself that has
> been extended to the length of the longer string by concatenation on
> the right of one or more
> pad characters, where the pad character is chosen based on CS. If CS
> has the NO PAD
> characteristic, then the pad character is an implementation-dependent
> character different
> from any character in the character set of X and Y that collates less
> than any string under
> CS. Otherwise, the pad character is a <space>.
>
> c) The result of the comparison of X and Y is given by the collating
> sequence CS.
>
> d) Depending on the collating sequence, two strings may compare as
> equal even if they are
> of different lengths or contain different sequences of characters.
> When any of the operations
> MAX, MIN, and DISTINCT reference a grouping column, and the UNION, EXCEPT,
> and INTERSECT operators refer to character strings, the specific value
> selected by these
> operations from a set of such equal values is implementation-dependent.
>
> NOTE 129 – If the coercibility characteristic of the comparison is
> Coercible, then the collating sequence
> used is the default defined for the character repertoire. See also
> other Syntax Rules in this Subclause,
> Subclause 10.6, ‘‘<character set specification>’’, and Subclause
> 11.30, ‘‘<character set definition>’’.
>
> 4) The comparison of two binary string values, X and Y, is determined
> by comparison of their
> octets with the same ordinal position. If Xi and Yi are the values of
> the i-th octets of X and Y,
> respectively, and if Lx is the length in octets of X AND Ly is the
> length in octets of Y, then X is
> equal to Y if and only if Ly = Ly and if Xi = Yi for all i.
I note that there is a typo in rule 4 for binary strings; Ly = Ly should
be Lx = Ly, since binary strings can only be compared for equality.

Rule 3.b details how strings of unequal length are to be compared. It
allows exactly the operation performed by SQLite, since it allows
collating sequences to have a NO PAD characteristic which results in the
shorter string comparing less than the longer string.

This distinction also appears in section 4.12 which discusses type
conversions and mixing of data types. It says;

> Values corresponding to the data types CHARACTER, CHARACTER VARYING,
> and CHARACTER
> LARGE OBJECT are mutually assignable if and only if they are taken
> from the same character
> repertoire. If they are from different character repertoires, then the
> value of the source of the
> assignment must be translated to the character repertoire of the
> target before an assignment is
> possible. Such translation may be implementation-defined and
> implicitly performed, in which case
> the two character data types are also mutually assignable. If a store
> assignment would result
> in the loss of non-<space> characters due to truncation, then an
> exception condition is raised. If
> a retrieval assignment would result in the loss of characters due to
> truncation, then a warning
> condition is raised. The values are mutually comparable only if they
> are mutually assignable
> and can be coerced to have the same collation. The comparison of two
> character strings depends
> on the collating sequence used for the comparison (see Table 3,
> ‘‘Collating sequence usage for
> comparisons’’). When values of unequal length are compared, if the
> collating sequence for the
> comparison has the NO PAD characteristic and the shorter value is
> equal to a prefix of the longer
> value, then the shorter value is considered less than the longer
> value. If the collating sequence for
> the comparison has the PAD SPACE characteristic, for the purposes of
> the comparison, the shorter
> value is effectively extended to the length of the longer by
> concatenation of <space>s on the right.
>
> Values corresponding to the binary data type are mutually assignable.
> If a store assignment would
> result in the loss of non-zero octets due to truncation, then an
> exception condition is raised. If a
> retrieval assignment would result in the loss of octets due to
> truncation, then a warning condition is
> raised. When binary string values are compared, they must have exactly
> the same length (in octets)
> to be considered equal. Binary string values can only be compared for
> equality.
>
Which again explains that a collating sequence can have a NO PAD
property which prevents padding the shorter string for comparison, and
that binary strings can only be compared for equality.

The only place in the standard that I can find any explicit mention of
removing spaces is the description of casting a string to a numeric
value. In this case the leading and trailing spaces are to be removed
from the string before it is converted.

So, while the standard does allow the operation you describe (actually
it does the opposite, it pads the shorter string with spaces, instead of
removing trailing spaces from the longer string), it also allows the
operation SQLite performs. It is simply the case that all of SQLite's
collations have the NO PAD characteristic.

HTH
Dennis Cote



-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: SQLite character comparisons

D. Richard Hipp
In reply to this post by John Elrick-2
John Elrick <[hidden email]> wrote:
>
> If we are ignoring trailing spaces, then by definition:
>
> ' ' = ''
>
> and for that matter:
>
> '          ' = '  '
>

Good point.  I added these as test cases for the new RTRIM
collation.  http://www.sqlite.org/cvstrac/chngview?cn=4735

--
D. Richard Hipp <[hidden email]>


-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

RE: SQLite character comparisons

Fowler, Jeff
In reply to this post by Dennis Cote
Hi Dennis,

I may very well be mistaken - wouldn't be the first time!! I tried to
access the actual specification from ANSI (www.ansi.org), but you either
have to purchase them or access one of the "drafts" (which a couple
folks in this thread have done), but the draft may or may not match the
actual standard. While Dr. Hipp's new RTRIM collation makes this a moot
point, the source for my statement came from this article:
http://support.microsoft.com/kb/316626


As I'd mentioned, we have both SQL Server 2005 and Oracle 10g and they
default to ignore trailing spaces on character compares, although it may
be possible to turn this off for those who don't like it. I believe (but
am not sure) that MYSql and PostGreSQL ignore them by default also. I
understand many of the arguments folks have against this and there's not
much point to continuing the discussion; I won't change their mind and
they won't change mine! But I guess that's why we have Republicans and
Democrats :-)

But all is well - the RTRIM collation option is a neat and simple
solution. My thanks to Dr. Hipp for that, and also to everyone who has
shown an interest in this issue.

- Jeff


-----Original Message-----
From: Dennis Cote [mailto:[hidden email]]
Sent: Monday, January 21, 2008 11:46 AM
To: [hidden email]
Subject: Re: [sqlite] SQLite character comparisons

Fowler, Jeff wrote:
> Hello All,
>  
> Not trying to be antagonistic, but I'm curious to know how many of you
agree with Darren's sentiments on this issue. To restate briefly, ANSI
SQL-92 specifies that when comparing two character fields, trailing
spaces should be ignored. Correct me if I'm wrong Darren, but you feel
this is a bad decision, and in fact SQLite's implementation of character
comparison (respecting trailing spaces) is superior to ANSI's specs.
Keep in mind this is not some obscure issue that can be subject to
different interpretations by different vendors; it's very clearly
stated: "The ANSI standard requires padding for the character strings
used in comparisons so that their lengths match before comparing them."
>  
Jeff,

I think you are mistaken about what the ANSI spec says.

There are two string types in ANSI SQL, character strings (which come is
several subtypes), and binary strings. The following excerpts are taken
from the SQL:1999 spec.

Section 4.2.1 Character Strings and Collations describes the operations
on character strings. It describes comparisons as
> Given a collating sequence, two character strings are identical if and

> only if they are equal in accordance with the comparison rules
> specified in Subclause 8.2, ''<comparison predicate>''. The collating
> sequence used for a particular comparison is determined as in
> Subclause 4.2.3, ''Rules determining collating sequence usage''.
Binary strings are defined in Section 4.3 as;
> A binary string is a sequence of octets that does not have either a
> character set or collation associated with it.
And their comparison is detailed in 4.3.1 as;
> All binary strings are mutually comparable. A binary string is
> identical to another binary string if and only if it is equal to that
> binary string in accordance with the comparison rules specified in
> Subclause 8.2, ''<comparison predicate>''.

General Rules 3 and 4 of section 8.2 <comparison predicate> describe the
comparison of these strings. I have copied these sections below.

> 3) The comparison of two character strings is determined as follows:
>
> a) Let CS be the collating sequence indicated in Subclause 4.2.3,
> ''Rules determining collating sequence usage'', based on the declared
> types of the two character strings.
>
> b) If the length in characters of X is not equal to the length in
> characters of Y, then the shorter string is effectively replaced, for
> the purposes of comparison, with a copy of itself that has been
> extended to the length of the longer string by concatenation on the
> right of one or more pad characters, where the pad character is chosen

> based on CS. If CS has the NO PAD characteristic, then the pad
> character is an implementation-dependent character different from any
> character in the character set of X and Y that collates less than any
> string under CS. Otherwise, the pad character is a <space>.
>
> c) The result of the comparison of X and Y is given by the collating
> sequence CS.
>
> d) Depending on the collating sequence, two strings may compare as
> equal even if they are of different lengths or contain different
> sequences of characters.
> When any of the operations
> MAX, MIN, and DISTINCT reference a grouping column, and the UNION,
> EXCEPT, and INTERSECT operators refer to character strings, the
> specific value selected by these operations from a set of such equal
> values is implementation-dependent.
>
> NOTE 129 - If the coercibility characteristic of the comparison is
> Coercible, then the collating sequence used is the default defined for

> the character repertoire. See also other Syntax Rules in this
> Subclause, Subclause 10.6, ''<character set specification>'', and
> Subclause 11.30, ''<character set definition>''.
>
> 4) The comparison of two binary string values, X and Y, is determined
> by comparison of their octets with the same ordinal position. If Xi
> and Yi are the values of the i-th octets of X and Y, respectively, and

> if Lx is the length in octets of X AND Ly is the length in octets of
> Y, then X is equal to Y if and only if Ly = Ly and if Xi = Yi for all
> i.
I note that there is a typo in rule 4 for binary strings; Ly = Ly should
be Lx = Ly, since binary strings can only be compared for equality.

Rule 3.b details how strings of unequal length are to be compared. It
allows exactly the operation performed by SQLite, since it allows
collating sequences to have a NO PAD characteristic which results in the
shorter string comparing less than the longer string.

This distinction also appears in section 4.12 which discusses type
conversions and mixing of data types. It says;
> Values corresponding to the data types CHARACTER, CHARACTER VARYING,
> and CHARACTER LARGE OBJECT are mutually assignable if and only if they

> are taken from the same character repertoire. If they are from
> different character repertoires, then the value of the source of the
> assignment must be translated to the character repertoire of the
> target before an assignment is possible. Such translation may be
> implementation-defined and implicitly performed, in which case the two

> character data types are also mutually assignable. If a store
> assignment would result in the loss of non-<space> characters due to
> truncation, then an exception condition is raised. If a retrieval
> assignment would result in the loss of characters due to truncation,
> then a warning condition is raised. The values are mutually comparable

> only if they are mutually assignable and can be coerced to have the
> same collation. The comparison of two character strings depends on the

> collating sequence used for the comparison (see Table 3, ''Collating
> sequence usage for comparisons''). When values of unequal length are
> compared, if the collating sequence for the comparison has the NO PAD
> characteristic and the shorter value is equal to a prefix of the
> longer value, then the shorter value is considered less than the
> longer value. If the collating sequence for the comparison has the PAD

> SPACE characteristic, for the purposes of the comparison, the shorter
> value is effectively extended to the length of the longer by
> concatenation of <space>s on the right.
>
> Values corresponding to the binary data type are mutually assignable.
> If a store assignment would
> result in the loss of non-zero octets due to truncation, then an
> exception condition is raised. If a retrieval assignment would result
> in the loss of octets due to truncation, then a warning condition is
> raised. When binary string values are compared, they must have exactly

> the same length (in octets) to be considered equal. Binary string
> values can only be compared for equality.
>
Which again explains that a collating sequence can have a NO PAD
property which prevents padding the shorter string for comparison, and
that binary strings can only be compared for equality.

The only place in the standard that I can find any explicit mention of
removing spaces is the description of casting a string to a numeric
value. In this case the leading and trailing spaces are to be removed
from the string before it is converted.

So, while the standard does allow the operation you describe (actually
it does the opposite, it pads the shorter string with spaces, instead of
removing trailing spaces from the longer string), it also allows the
operation SQLite performs. It is simply the case that all of SQLite's
collations have the NO PAD characteristic.

HTH
Dennis Cote



------------------------------------------------------------------------
-----
To unsubscribe, send email to [hidden email]
------------------------------------------------------------------------
-----


-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: SQLite character comparisons

Dennis Cote
Fowler, Jeff wrote:
> I may very well be mistaken - wouldn't be the first time!! I tried to
> access the actual specification from ANSI (www.ansi.org), but you either
> have to purchase them or access one of the "drafts" (which a couple
> folks in this thread have done), but the draft may or may not match the
> actual standard. While Dr. Hipp's new RTRIM collation makes this a moot
> point, the source for my statement came from this article:
> http://support.microsoft.com/kb/316626
>
>  
After reading the MS article, I dug into the standard some more to see
if I had missed something.

It seems there is a requirement that all "standard character set names"
define a default collation that has the PAD SPACE property (see section
10.6 general rule 2).

> 2) A <standard character set name> specifies the name of a character
> set that is defined by a
> national or international standard. The character repertoire of CS,
> implied by the <standard
> character set name>, are defined by the standard defining the
> character set identified by that
> <standard character set name>. The default collating sequence of the
> character set is defined
> by the order of the characters in the standard and has the PAD SPACE
> characteristic.
There is a similar definition for an "implementation defined character
set name" in general rule 3.
> 3) An <implementation-defined character set name> specifies the name
> of a character set that
> is implementation-defined. The character repertoire of CS, implied by
> the <implementationdefined
> character set name>, are implementation-defined. The default collating
> sequence of the
> character set and whether the collating sequence has the NO PAD
> characteristic or the PAD
> SPACE characteristic is implementation-defined.
But syntax rule 1 in that section seems to say that what is supported is
implementation defined.
> 1) The <standard character set name>s and <implementation-defined
> character set name>s that
> are supported are implementation-defined.
These are character sets, and predefined collations, can be used to
create additional collations as described in section 11.32
> <collation definition> ::=
>     CREATE COLLATION <collation name> FOR <character set specification>
>     FROM <existing collation name>
>     [ <pad characteristic> ]
> <existing collation name> ::= <collation name>
> <pad characteristic> ::=
>     NO PAD
>     | PAD SPACE
If the pad characteristic is not given explicitly, it is inherited
implicitly from the existing collation name.

So if an implementation provides a standard character set, such as
SQL_TEXT, it must also have a default collation with the PAD SPACE
property that is the default collation for strings in that character
set. The user can create a new collation that explicitly overrides the
pad characteristic and use that for NO PAD operation.

But ultimately there is no requirement to provide any standard character
sets. So all bets are off when switching implementations (as Darren has
suggested).

Dennis Cote


-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: SQLite character comparisons

Stephen Oberholtzer
I feel compelled to throw in my $0.02 here.

To everyone who thinks that SQLite should allow  'foo '  == 'foo':

SQL was originally conceived as a query *language* -- a way for a
human being to request a set of data from a database.  It was
specifically designed for ad-hoc queries.

This little 'magic space trimming' feature exists to match the
'char(N)' data type.  A char(10) field is always exactly 10 characters
long; longer strings are truncated and shorter strings are
space-padded.
Most database engines are more efficient at these, because when all
rows are the same width, the task of finding a particular row reduces
to a simple array lookup; therefore, if performance is
a critical issue (and when SQL was first formed, CPUs weren't quite as
powerful as they are now.)

But this presents a problem: the 'usual' definition of equality would
mean that any comparisons to a char(N) field would need to be N
characters long, or they would always fail.  Since it's stupid to make
people count spaces, somebody came up with the solution 'if they enter
something shorter, pad it with spaces and then compare.'

(If anyone wishes to quote the spec regarding space-extension and
varchar(N) fields, first recall that the SQL specifications have been
created by committees.)

Since SQLite does not have any concept of a fixed-width field of
character data, the whole concept of ignoring/appending trailing
spaces doesn't even apply.

--
-- Stevie-O
Real programmers use COPY CON PROGRAM.EXE

-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

12