Possible SQLite bug using an uncollated index with a collated query

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Possible SQLite bug using an uncollated index with a collated query

Jens Alfke-2
I believe I may have found a bug in SQLite 3.18. I've got a query that returns a correct result set when there are no indexes in the database, but returns an incorrect result set if recompiled after an index has been added. The incorrect result set has no collation applied, so the problem seems to be that when SQLite uses the index to optimize the query, it forgets to apply collation when grouping and ordering the results, so the results are in uncollated order as they appeared in the index.

Here's a slightly simplified* version of the query. This just returns a sorted list of all the unique artist names (the database is generated from an iTunes library, with each row containing a JSON object representing a track.) The "LCUnicode_CD_" custom collation compares UTF-8-encoded Unicode strings ignoring case and diacritics; it's been tested enough that I trust it.

        SELECT json_extract(body, '$.Artist')
        FROM kv_default
        WHERE json_extract body, '$.Artist') IS NOT NULL AND json_extract body, '$.Compilation') IS NULL
        GROUP BY json_extract(body, '$.Artist') COLLATE LCUnicode_CD_
        ORDER BY json_extract(body, '$.Artist') COLLATE LCUnicode_CD_;

With no index in the database, this query returns the correct results. Artists are sorted ignoring case and diacritical marks, and artist names that have inconsistent case or diacritics on different tracks (like "The B-52s" vs "The b-52s", or "Björk" vs "Bjork") only appear once.

Now I add an index:

        CREATE INDEX byArtist
        ON kv_default (json_extract(body, '$.Compilation'), json_extract(body, '$.Artist'))

After creating the index, recompiling and running the query now produces incorrect results: the collation is ignored, so the sort order is case-sensitive, accented letters appear after Z, and duplicate artist names with different capitalization/accents appear.

Before the index is added, EXPLAIN QUERY PLAN gives:
        0|0|0| SCAN TABLE kv_default
        0|0|0| USE TEMP B-TREE FOR GROUP BY

After the index is added, it gives:
        0|0|0| SEARCH TABLE kv_default USING INDEX byArtist (<expr>=?)

It's forgotten that it needs to sort/group the rows coming from the index, because the index isn't collated. :(

I haven't yet tried to create a minimal reproduction, but I'm guessing I could reproduce this without using JSON functions or custom collations.

—Jens

* I've omitted some irrelevant result columns, and substituted json_extract for the actual custom function I use which has similar behavior (extracting a value out of an encoded blob.)
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Possible SQLite bug using an uncollated index with a collated query

Dan Kennedy-4
On 08/18/2017 05:39 AM, Jens Alfke wrote:
> I believe I may have found a bug in SQLite 3.18. I've got a query that returns a correct result set when there are no indexes in the database, but returns an incorrect result set if recompiled after an index has been added. The incorrect result set has no collation applied, so the problem seems to be that when SQLite uses the index to optimize the query, it forgets to apply collation when grouping and ordering the results, so the results are in uncollated order as they appeared in the index.

Thanks for reporting this. Bug ticket and fix now here:

   http://www.sqlite.org/src/info/e20dd54ab0e43838
   http://www.sqlite.org/src/info/37e1900880b70be6

Dan.



>
> Here's a slightly simplified* version of the query. This just returns a sorted list of all the unique artist names (the database is generated from an iTunes library, with each row containing a JSON object representing a track.) The "LCUnicode_CD_" custom collation compares UTF-8-encoded Unicode strings ignoring case and diacritics; it's been tested enough that I trust it.
>
> SELECT json_extract(body, '$.Artist')
> FROM kv_default
> WHERE json_extract body, '$.Artist') IS NOT NULL AND json_extract body, '$.Compilation') IS NULL
> GROUP BY json_extract(body, '$.Artist') COLLATE LCUnicode_CD_
> ORDER BY json_extract(body, '$.Artist') COLLATE LCUnicode_CD_;
>
> With no index in the database, this query returns the correct results. Artists are sorted ignoring case and diacritical marks, and artist names that have inconsistent case or diacritics on different tracks (like "The B-52s" vs "The b-52s", or "Björk" vs "Bjork") only appear once.
>
> Now I add an index:
>
> CREATE INDEX byArtist
> ON kv_default (json_extract(body, '$.Compilation'), json_extract(body, '$.Artist'))
>
> After creating the index, recompiling and running the query now produces incorrect results: the collation is ignored, so the sort order is case-sensitive, accented letters appear after Z, and duplicate artist names with different capitalization/accents appear.
>
> Before the index is added, EXPLAIN QUERY PLAN gives:
> 0|0|0| SCAN TABLE kv_default
> 0|0|0| USE TEMP B-TREE FOR GROUP BY
>
> After the index is added, it gives:
> 0|0|0| SEARCH TABLE kv_default USING INDEX byArtist (<expr>=?)
>
> It's forgotten that it needs to sort/group the rows coming from the index, because the index isn't collated. :(
>
> I haven't yet tried to create a minimal reproduction, but I'm guessing I could reproduce this without using JSON functions or custom collations.
>
> —Jens
>
> * I've omitted some irrelevant result columns, and substituted json_extract for the actual custom function I use which has similar behavior (extracting a value out of an encoded blob.)
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Possible SQLite bug using an uncollated index with a collated query

Jens Alfke-2

> On Aug 18, 2017, at 11:29 PM, Dan Kennedy <[hidden email]> wrote:
>
> On 08/18/2017 05:39 AM, Jens Alfke wrote:
>> I believe I may have found a bug in SQLite 3.18. I've got a query that returns a correct result set when there are no indexes in the database, but returns an incorrect result set if recompiled after an index has been added. The incorrect result set has no collation applied, so the problem seems to be that when SQLite uses the index to optimize the query, it forgets to apply collation when grouping and ordering the results, so the results are in uncollated order as they appeared in the index.
>
> Thanks for reporting this. Bug ticket and fix now here:
>
>  http://www.sqlite.org/src/info/e20dd54ab0e43838 <http://www.sqlite.org/src/info/e20dd54ab0e43838>
>  http://www.sqlite.org/src/info/37e1900880b70be6 <http://www.sqlite.org/src/info/37e1900880b70be6>
W00t! I feel proud to have discovered an actual bug in SQLite. Are there nerd merit badges available for this? ;-)
(And thanks for the quick fix!)

—Jens

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users