GROUP BY and ICU collation

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

GROUP BY and ICU collation

Michael Herger
>> Hi there,
>>
>> I'm trying to create a list with an index list. Eg. I have artists:
>>
>> Sting
>> Šuma Čovjek
>> Suzanne Vega
>>
>> That's the sort order I'd get using an ICU collation. "Šuma Čovjek"
>> would be sorted as "Suma..." as expected.
>>
>> Now I'd like to create an index bar by providing groups of the first
>> character:
>>
>> SELECT SUBSTR(name,1,1), count(distinct id) FROM artists GROUP BY
>> SUBSTR(name,1,1) ORDER BY name COLLATE de_DE
>
> Aren't you missing a COLLATE clause after the GROUP BY term?
>
>      ... GROUP BY SUBSTR(name,1,1) COLLATE de_DE ...

TBH: I didn't even know about this. I thought the COLLATE at the end of
the statement would do it for all.

Alas, tried again to no avail. No matter whether I add it after the
GROUP BY or not, the result is the same.

I should probably have added some version information: I'm using the
Perl DBD::SQLite 1.58 (sqlite 3.22.0). I didn't see any mention of
related changes in the changelog for SQLite. What would be the easiest
(and most reliable) way to try to reproduce this without Perl? Is there
a HowTo use collations with the CLI sqlite?

--

Michael
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: GROUP BY and ICU collation

Simon Slavin-3
On 7 Feb 2019, at 12:12pm, [hidden email] wrote:

> TBH: I didn't even know about this. I thought the COLLATE at the end of the statement would do it for all.

If you don't want to have to keep specifying the COLLATE, put it in the table definition:

CREATE TABLE ... ( ... artist TEXT COLLATE de_DE ...)

Then it will be used for every operation involving that column, unless it collides with another specified COLLATE.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: GROUP BY and ICU collation

Michael Herger
>> TBH: I didn't even know about this. I thought the COLLATE at the end of the statement would do it for all.
>
> If you don't want to have to keep specifying the COLLATE, put it in the table definition:
>
> CREATE TABLE ... ( ... artist TEXT COLLATE de_DE ...)

Thanks for the hint. But the application is localized to the user's
language. Therefore the collation can have different values.

--

Michael
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: GROUP BY and ICU collation

Simon Slavin-3
On 7 Feb 2019, at 5:02pm, [hidden email] wrote:

>> If you don't want to have to keep specifying the COLLATE, put it in the table definition:
>> CREATE TABLE ... ( ... artist TEXT COLLATE de_DE ...)
>
> Thanks for the hint. But the application is localized to the user's language. Therefore the collation can have different values.

This is currently a problem with SQLite.  You can't specify a collation as a string or a variable.  In other words, neither of the following work:

  *** CREATE TABLE ... ( ... artist TEXT COLLATE 'de_DE' ...) ***

  *** SELECT a FROM t ORDER BY a COLLATE 'de_DE' ***

Also, if you use ALTER TABLE RENAME COLUMN other references to that column in the schema get changed accordingly.  This makes it impossible to change a column's COLLATE easily.

So you end up making your CREATE TABLE command in software once you know the preferred collation for that particular installation.  The advantage of doing this is that it only needs to be done once, at installation or first run, and nothing else in your code needs to worry about which collation to use.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: GROUP BY and ICU collation

Keith Medcalf
In reply to this post by Michael Herger


COLLATE affects SORTING, it does not transmorgify the "value" of the thing to which it is applied.  That is, name COLLATE xxxx means that the item name is SORTED using the collating sequence xxxx, not that the result of "name COLLATE xxxx" is transmorgified into tha value that is used for sorting.  

That is

select name collate nocase, count(distinct id) from x group by name collate nocase order by name collate nocase

whill produce cased output not the value that was used for the sorting.


select lower(name collate nocase), count(distinct id) from x group by name collate nocase order by name collate nocase;

to transmorgificate name into a "caseless" representation.  So you would need to do something like this:

select de_DE(substr(name collate de_DE,1,1)), count(distinct id) from artists
group by substr(name collate de_DE,1,1)
order by by substr(name collate de_DE,1,1)

and the function de_DE would have to transmorgificate its value to the result you want to see.

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.

>-----Original Message-----
>From: sqlite-users [mailto:sqlite-users-
>[hidden email]] On Behalf Of [hidden email]
>Sent: Thursday, 7 February, 2019 05:12
>To: [hidden email]
>Subject: [sqlite] GROUP BY and ICU collation
>
>>> Hi there,
>>>
>>> I'm trying to create a list with an index list. Eg. I have
>artists:
>>>
>>> Sting
>>> Šuma Čovjek
>>> Suzanne Vega
>>>
>>> That's the sort order I'd get using an ICU collation. "Šuma
>Čovjek"
>>> would be sorted as "Suma..." as expected.
>>>
>>> Now I'd like to create an index bar by providing groups of the
>first
>>> character:
>>>
>>> SELECT SUBSTR(name,1,1), count(distinct id) FROM artists GROUP BY
>>> SUBSTR(name,1,1) ORDER BY name COLLATE de_DE>>
>> Aren't you missing a COLLATE clause after the GROUP BY term?
>>
>>      ... GROUP BY SUBSTR(name,1,1) COLLATE de_DE ...
>
>TBH: I didn't even know about this. I thought the COLLATE at the end
>of
>the statement would do it for all.
>
>Alas, tried again to no avail. No matter whether I add it after the
>GROUP BY or not, the result is the same.
>
>I should probably have added some version information: I'm using the
>Perl DBD::SQLite 1.58 (sqlite 3.22.0). I didn't see any mention of
>related changes in the changelog for SQLite. What would be the
>easiest
>(and most reliable) way to try to reproduce this without Perl? Is
>there
>a HowTo use collations with the CLI sqlite?
>
>--
>
>Michael
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: GROUP BY and ICU collation

Michael Herger
> COLLATE affects SORTING, it does not transmorgify the "value" of the thing to which it is applied.  That is, name COLLATE xxxx means that the item name is SORTED using the collating sequence xxxx, not that the result of "name COLLATE xxxx" is transmorgified into tha value that is used for sorting.

Understood. But wouldn't a GROUP BY sort the data internally in order to
be able to group records? Or would you not at least expect it to follow
the same rules grouping as when sorting? In sorting it seems to consider
Š "the same" as S, but it doesn't in grouping. I'm not too concerned
about the representation.

Michael


>
> That is
>
> select name collate nocase, count(distinct id) from x group by name collate nocase order by name collate nocase
>
> whill produce cased output not the value that was used for the sorting.
>
>
> select lower(name collate nocase), count(distinct id) from x group by name collate nocase order by name collate nocase;
>
> to transmorgificate name into a "caseless" representation.  So you would need to do something like this:
>
> select de_DE(substr(name collate de_DE,1,1)), count(distinct id) from artists
> group by substr(name collate de_DE,1,1)
> order by by substr(name collate de_DE,1,1)
>
> and the function de_DE would have to transmorgificate its value to the result you want to see.
>
> ---
> The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.
>
>> -----Original Message-----
>> From: sqlite-users [mailto:sqlite-users-
>> [hidden email]] On Behalf Of [hidden email]
>> Sent: Thursday, 7 February, 2019 05:12
>> To: [hidden email]
>> Subject: [sqlite] GROUP BY and ICU collation
>>
>>>> Hi there,
>>>>
>>>> I'm trying to create a list with an index list. Eg. I have
>> artists:
>>>>
>>>> Sting
>>>> Šuma Čovjek
>>>> Suzanne Vega
>>>>
>>>> That's the sort order I'd get using an ICU collation. "Šuma
>> Čovjek"
>>>> would be sorted as "Suma..." as expected.
>>>>
>>>> Now I'd like to create an index bar by providing groups of the
>> first
>>>> character:
>>>>
>>>> SELECT SUBSTR(name,1,1), count(distinct id) FROM artists GROUP BY
>>>> SUBSTR(name,1,1) ORDER BY name COLLATE de_DE>>
>>> Aren't you missing a COLLATE clause after the GROUP BY term?
>>>
>>>       ... GROUP BY SUBSTR(name,1,1) COLLATE de_DE ...
>>
>> TBH: I didn't even know about this. I thought the COLLATE at the end
>> of
>> the statement would do it for all.
>>
>> Alas, tried again to no avail. No matter whether I add it after the
>> GROUP BY or not, the result is the same.
>>
>> I should probably have added some version information: I'm using the
>> Perl DBD::SQLite 1.58 (sqlite 3.22.0). I didn't see any mention of
>> related changes in the changelog for SQLite. What would be the
>> easiest
>> (and most reliable) way to try to reproduce this without Perl? Is
>> there
>> a HowTo use collations with the CLI sqlite?
>>
>> --
>>
>> Michael
>> _______________________________________________
>> sqlite-users mailing list
>> [hidden email]
>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
>
>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: GROUP BY and ICU collation

Simon Slavin-3
On 8 Feb 2019, at 5:47am, [hidden email] wrote:

> In sorting it seems to consider Š "the same" as S, but it doesn't in grouping.

That may be a bug.  Can you construct a small test database, where your data features, say, R, S, T, and Š, and show how ORDER BY behaves differently to GROUP BY ?

Please be clear which version of SQLite you're using for your demo.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: GROUP BY and ICU collation

Keith Medcalf
In reply to this post by Michael Herger

On Thursday, 7 February, 2019 22:47, [hidden email] wrote:

>> COLLATE affects SORTING, it does not transmorgify the "value" of
>the thing to which it is applied.  That is, name COLLATE xxxx means
>that the item name is SORTED using the collating sequence xxxx, not
>that the result of "name COLLATE xxxx" is transmorgified into tha
>value that is used for sorting.

>Understood. But wouldn't a GROUP BY sort the data internally in order
>to be able to group records? Or would you not at least expect it to
>follow the same rules grouping as when sorting? In sorting it seems to
>consider Š "the same" as S, but it doesn't in grouping. I'm not too concerned
>about the representation.

But you are not using the same "expression" for selecting, sorting, and grouping.  That is, you need to specify:

  SELECT expression, count(distinct id)
    FROM artists
GROUP BY expression
ORDER BY expression;

where expression is the expression that you want to use

  SELECT substr(name collate de_DE, 1, 1), count(distinct id)
    FROM artists
GROUP BY substr(name collate de_DE, 1, 1)
ORDER BY substr(name collate de_DE, 1, 1);

If you do not do so then you cannot expect the "GROUP BY" to be using the same ordering as the "ORDER BY" nor the select to be returning the same value that was used to do the grouping and sorting.

(This substitutes the collation NOCASE for the de_DE since I no have a de_DE collation:

sqlite> select substr(name collate nocase, 1, 1), count(distinct id)
   ...>   FROM artists
   ...> group by substr(name collate nocase, 1, 1)
   ...> order by substr(name collate nocase, 1, 1);
QUERY PLAN
|--SCAN TABLE artists (~1048576 rows)
`--USE TEMP B-TREE FOR GROUP BY
addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     57    0                    00  Start at 57
1     Noop           1     4     0                    00
2     SorterOpen     3     3     0     k(1,NOCASE)    00
3     Integer        0     5     0                    00  r[5]=0; clear abort flag
4     Null           0     8     8                    00  r[8..8]=NULL
5     Gosub          7     52    0                    00
6     OpenRead       0     3     0     2              00  root=3 iDb=0; artists
7     ColumnsUsed    0     0     0     3              00
8     Explain        8     0     0     SCAN TABLE artists (~1048576 rows)  00
9     Noop           0     0     0                    00  Begin WHERE-loop0: artists
10    Rewind         0     20    0                    00
11      Noop           0     0     0                    00  Begin WHERE-core
12      Column         0     1     13                   00  r[13]=artists.name
13      Function0      6     13    10    substr(3)      03  r[10]=func(r[13..15])
14      Column         0     1     11                   00  r[11]=artists.name
15      Column         0     0     12                   00  r[12]=artists.id
16      MakeRecord     10    3     16                   00  r[16]=mkrec(r[10..12])
17      SorterInsert   3     16    0                    00  key=r[16]
18      Noop           0     0     0                    00  End WHERE-core
19    Next           0     11    0                    01
20    Noop           0     0     0                    00  End WHERE-loop0: artists
21    OpenPseudo     4     16    3                    00  3 columns in r[16]
22    SorterSort     3     56    0                    00  GROUP BY sort
23      SorterData     3     16    4                    00  r[16]=data
24      Column         4     0     9                    00  r[9]=
25      Compare        8     9     1     k(1,NOCASE)    00  r[8] <-> r[9]
26      Jump           27    31    27                   00
27      Move           9     8     1                    00  r[8]=r[9]
28      Gosub          6     44    0                    00  output one row
29      IfPos          5     56    0                    00  if r[5]>0 then r[5]-=0, goto 56; check abort flag
30      Gosub          7     52    0                    00  reset accumulator
31      Column         4     2     17                   00  r[17]=
32      Found          2     36    17    1              00  key=r[17]
33      MakeRecord     17    1     18                   00  r[18]=mkrec(r[17])
34      IdxInsert      2     18    17    1              10  key=r[18]
35      AggStep        0     17    2     count(1)       01  accum=r[2] step(r[17])
36      If             4     38    0                    00
37      Column         4     1     1                    00  r[1]=
38      Integer        1     4     0                    00  r[4]=1; indicate data in accumulator
39    SorterNext     3     23    0                    00
40    Gosub          6     44    0                    00  output final row
41    Goto           0     56    0                    00
42    Integer        1     5     0                    00  r[5]=1; set abort flag
43    Return         6     0     0                    00
44    IfPos          4     46    0                    00  if r[4]>0 then r[4]-=0, goto 46; Groupby result generator entry point
45    Return         6     0     0                    00
46    AggFinal       2     1     0     count(1)       00  accum=r[2] N=1
47    Copy           1     21    0                    00  r[21]=r[1]
48    Function0      6     21    19    substr(3)      03  r[19]=func(r[21..23])
49    Copy           2     20    0                    00  r[20]=r[2]
50    ResultRow      19    2     0                    00  output=r[19..20]
51    Return         6     0     0                    00  end groupby result generator
52    Null           0     1     3                    00  r[1..3]=NULL
53    OpenEphemeral  2     0     0     k(1,B)         00  nColumn=0
54    Integer        0     4     0                    00  r[4]=0; indicate accumulator empty
55    Return         7     0     0                    00
56    Halt           0     0     0                    00
57    Transaction    0     0     2     0              01  usesStmtJournal=0
58    Integer        1     14    0                    00  r[14]=1
59    Integer        1     15    0                    00  r[15]=1
60    Integer        1     22    0                    00  r[22]=1
61    Integer        1     23    0                    00  r[23]=1
62    Goto           0     1     0                    00
sqlite>

>
>Michael
>
>
>>
>> That is
>>
>> select name collate nocase, count(distinct id) from x group by name
>collate nocase order by name collate nocase
>>
>> whill produce cased output not the value that was used for the
>sorting.
>>
>>
>> select lower(name collate nocase), count(distinct id) from x group
>by name collate nocase order by name collate nocase;
>>
>> to transmorgificate name into a "caseless" representation.  So you
>would need to do something like this:
>>
>> select de_DE(substr(name collate de_DE,1,1)), count(distinct id)
>from artists
>> group by substr(name collate de_DE,1,1)
>> order by by substr(name collate de_DE,1,1)
>>
>> and the function de_DE would have to transmorgificate its value to
>the result you want to see.
>>
>> ---
>> The fact that there's a Highway to Hell but only a Stairway to
>Heaven says a lot about anticipated traffic volume.
>>
>>> -----Original Message-----
>>> From: sqlite-users [mailto:sqlite-users-
>>> [hidden email]] On Behalf Of [hidden email]
>>> Sent: Thursday, 7 February, 2019 05:12
>>> To: [hidden email]
>>> Subject: [sqlite] GROUP BY and ICU collation
>>>
>>>>> Hi there,
>>>>>
>>>>> I'm trying to create a list with an index list. Eg. I have
>>> artists:
>>>>>
>>>>> Sting
>>>>> Šuma Čovjek
>>>>> Suzanne Vega
>>>>>
>>>>> That's the sort order I'd get using an ICU collation. "Šuma
>>> Čovjek"
>>>>> would be sorted as "Suma..." as expected.
>>>>>
>>>>> Now I'd like to create an index bar by providing groups of the
>>> first
>>>>> character:
>>>>>
>>>>> SELECT SUBSTR(name,1,1), count(distinct id) FROM artists GROUP
>BY
>>>>> SUBSTR(name,1,1) ORDER BY name COLLATE de_DE>>
>>>> Aren't you missing a COLLATE clause after the GROUP BY term?
>>>>
>>>>       ... GROUP BY SUBSTR(name,1,1) COLLATE de_DE ...
>>>
>>> TBH: I didn't even know about this. I thought the COLLATE at the
>end
>>> of
>>> the statement would do it for all.
>>>
>>> Alas, tried again to no avail. No matter whether I add it after
>the
>>> GROUP BY or not, the result is the same.
>>>
>>> I should probably have added some version information: I'm using
>the
>>> Perl DBD::SQLite 1.58 (sqlite 3.22.0). I didn't see any mention of
>>> related changes in the changelog for SQLite. What would be the
>>> easiest
>>> (and most reliable) way to try to reproduce this without Perl? Is
>>> there
>>> a HowTo use collations with the CLI sqlite?
>>>
>>> --
>>>
>>> Michael

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.




_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: GROUP BY and ICU collation

Michael Herger
Hi Keith,

thanks for your response (which partly goes beyond my understanding, but
I'm learning :-)).

> But you are not using the same "expression" for selecting, sorting, and grouping.  That is, you need to specify:
>
>    SELECT expression, count(distinct id)
>      FROM artists
> GROUP BY expression
> ORDER BY expression;
>
> where expression is the expression that you want to use
>
>    SELECT substr(name collate de_DE, 1, 1), count(distinct id)
>      FROM artists
> GROUP BY substr(name collate de_DE, 1, 1)
> ORDER BY substr(name collate de_DE, 1, 1);
>
> If you do not do so then you cannot expect the "GROUP BY" to be using the same ordering as the "ORDER BY" nor the select to be returning the same value that was used to do the grouping and sorting.

Ok, tried that:

SELECT SUBSTR(contributors.namesort COLLATE de_DE , 1, 1)
FROM contributors
GROUP BY SUBSTR(contributors.namesort COLLATE de_DE , 1, 1)
ORDER BY SUBSTR(contributors.namesort COLLATE de_DE , 1, 1)

Resulted in:

"A"
"C"
"D"
"P"
"R"
"S"
"Š"
"T"
"W"

Whereas this:

SELECT contributors.name, contributors.namesort COLLATE de_DE
FROM contributors
ORDER BY contributors.namesort COLLATE de_DE

...resulted in this:

...
   ["Roddy Woomble & Band", "RODDY WOOMBLE BAND"],
   ["Samuel Yirga", "SAMUEL YIRGA"],
   ["Stephin Merritt", "STEPHIN MERRITT"],
   ["Šuma Čovjek", "ŠUMA ČOVJEK"],
   ["Syriana", "SYRIANA"],
   ["Tom Griesgraber", "TOM GRIESGRABER"],
...

So despite my using the same expression for the GROUP as for the ORDER,
the grouping seems to ignore the collation, whereas the sorting alone
would not.

I'm using the ICU extension (otherwise the sorting wouldn't work either,
right?). Could that extension be responsible for this odd behaviour?

I've put a small sample .db file in my dropbox:
https://www.dropbox.com/s/w4h6n3cyamt6yc0/library.db?dl=0


>
> (This substitutes the collation NOCASE for the de_DE since I no have a de_DE collation:
>
> sqlite> select substr(name collate nocase, 1, 1), count(distinct id)
>     ...>   FROM artists
>     ...> group by substr(name collate nocase, 1, 1)
>     ...> order by substr(name collate nocase, 1, 1);
> QUERY PLAN
> |--SCAN TABLE artists (~1048576 rows)
> `--USE TEMP B-TREE FOR GROUP BY
> addr  opcode         p1    p2    p3    p4             p5  comment
> ----  -------------  ----  ----  ----  -------------  --  -------------
> 0     Init           0     57    0                    00  Start at 57
> 1     Noop           1     4     0                    00
> 2     SorterOpen     3     3     0     k(1,NOCASE)    00
> 3     Integer        0     5     0                    00  r[5]=0; clear abort flag
> 4     Null           0     8     8                    00  r[8..8]=NULL
> 5     Gosub          7     52    0                    00
> 6     OpenRead       0     3     0     2              00  root=3 iDb=0; artists
> 7     ColumnsUsed    0     0     0     3              00
> 8     Explain        8     0     0     SCAN TABLE artists (~1048576 rows)  00
> 9     Noop           0     0     0                    00  Begin WHERE-loop0: artists
> 10    Rewind         0     20    0                    00
> 11      Noop           0     0     0                    00  Begin WHERE-core
> 12      Column         0     1     13                   00  r[13]=artists.name
> 13      Function0      6     13    10    substr(3)      03  r[10]=func(r[13..15])
> 14      Column         0     1     11                   00  r[11]=artists.name
> 15      Column         0     0     12                   00  r[12]=artists.id
> 16      MakeRecord     10    3     16                   00  r[16]=mkrec(r[10..12])
> 17      SorterInsert   3     16    0                    00  key=r[16]
> 18      Noop           0     0     0                    00  End WHERE-core
> 19    Next           0     11    0                    01
> 20    Noop           0     0     0                    00  End WHERE-loop0: artists
> 21    OpenPseudo     4     16    3                    00  3 columns in r[16]
> 22    SorterSort     3     56    0                    00  GROUP BY sort
> 23      SorterData     3     16    4                    00  r[16]=data
> 24      Column         4     0     9                    00  r[9]=
> 25      Compare        8     9     1     k(1,NOCASE)    00  r[8] <-> r[9]
> 26      Jump           27    31    27                   00
> 27      Move           9     8     1                    00  r[8]=r[9]
> 28      Gosub          6     44    0                    00  output one row
> 29      IfPos          5     56    0                    00  if r[5]>0 then r[5]-=0, goto 56; check abort flag
> 30      Gosub          7     52    0                    00  reset accumulator
> 31      Column         4     2     17                   00  r[17]=
> 32      Found          2     36    17    1              00  key=r[17]
> 33      MakeRecord     17    1     18                   00  r[18]=mkrec(r[17])
> 34      IdxInsert      2     18    17    1              10  key=r[18]
> 35      AggStep        0     17    2     count(1)       01  accum=r[2] step(r[17])
> 36      If             4     38    0                    00
> 37      Column         4     1     1                    00  r[1]=
> 38      Integer        1     4     0                    00  r[4]=1; indicate data in accumulator
> 39    SorterNext     3     23    0                    00
> 40    Gosub          6     44    0                    00  output final row
> 41    Goto           0     56    0                    00
> 42    Integer        1     5     0                    00  r[5]=1; set abort flag
> 43    Return         6     0     0                    00
> 44    IfPos          4     46    0                    00  if r[4]>0 then r[4]-=0, goto 46; Groupby result generator entry point
> 45    Return         6     0     0                    00
> 46    AggFinal       2     1     0     count(1)       00  accum=r[2] N=1
> 47    Copy           1     21    0                    00  r[21]=r[1]
> 48    Function0      6     21    19    substr(3)      03  r[19]=func(r[21..23])
> 49    Copy           2     20    0                    00  r[20]=r[2]
> 50    ResultRow      19    2     0                    00  output=r[19..20]
> 51    Return         6     0     0                    00  end groupby result generator
> 52    Null           0     1     3                    00  r[1..3]=NULL
> 53    OpenEphemeral  2     0     0     k(1,B)         00  nColumn=0
> 54    Integer        0     4     0                    00  r[4]=0; indicate accumulator empty
> 55    Return         7     0     0                    00
> 56    Halt           0     0     0                    00
> 57    Transaction    0     0     2     0              01  usesStmtJournal=0
> 58    Integer        1     14    0                    00  r[14]=1
> 59    Integer        1     15    0                    00  r[15]=1
> 60    Integer        1     22    0                    00  r[22]=1
> 61    Integer        1     23    0                    00  r[23]=1
> 62    Goto           0     1     0                    00
> sqlite>
>
>>
>> Michael
>>
>>
>>>
>>> That is
>>>
>>> select name collate nocase, count(distinct id) from x group by name
>> collate nocase order by name collate nocase
>>>
>>> whill produce cased output not the value that was used for the
>> sorting.
>>>
>>>
>>> select lower(name collate nocase), count(distinct id) from x group
>> by name collate nocase order by name collate nocase;
>>>
>>> to transmorgificate name into a "caseless" representation.  So you
>> would need to do something like this:
>>>
>>> select de_DE(substr(name collate de_DE,1,1)), count(distinct id)
>>from artists
>>> group by substr(name collate de_DE,1,1)
>>> order by by substr(name collate de_DE,1,1)
>>>
>>> and the function de_DE would have to transmorgificate its value to
>> the result you want to see.
>>>
>>> ---
>>> The fact that there's a Highway to Hell but only a Stairway to
>> Heaven says a lot about anticipated traffic volume.
>>>
>>>> -----Original Message-----
>>>> From: sqlite-users [mailto:sqlite-users-
>>>> [hidden email]] On Behalf Of [hidden email]
>>>> Sent: Thursday, 7 February, 2019 05:12
>>>> To: [hidden email]
>>>> Subject: [sqlite] GROUP BY and ICU collation
>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> I'm trying to create a list with an index list. Eg. I have
>>>> artists:
>>>>>>
>>>>>> Sting
>>>>>> Šuma Čovjek
>>>>>> Suzanne Vega
>>>>>>
>>>>>> That's the sort order I'd get using an ICU collation. "Šuma
>>>> Čovjek"
>>>>>> would be sorted as "Suma..." as expected.
>>>>>>
>>>>>> Now I'd like to create an index bar by providing groups of the
>>>> first
>>>>>> character:
>>>>>>
>>>>>> SELECT SUBSTR(name,1,1), count(distinct id) FROM artists GROUP
>> BY
>>>>>> SUBSTR(name,1,1) ORDER BY name COLLATE de_DE>>
>>>>> Aren't you missing a COLLATE clause after the GROUP BY term?
>>>>>
>>>>>        ... GROUP BY SUBSTR(name,1,1) COLLATE de_DE ...
>>>>
>>>> TBH: I didn't even know about this. I thought the COLLATE at the
>> end
>>>> of
>>>> the statement would do it for all.
>>>>
>>>> Alas, tried again to no avail. No matter whether I add it after
>> the
>>>> GROUP BY or not, the result is the same.
>>>>
>>>> I should probably have added some version information: I'm using
>> the
>>>> Perl DBD::SQLite 1.58 (sqlite 3.22.0). I didn't see any mention of
>>>> related changes in the changelog for SQLite. What would be the
>>>> easiest
>>>> (and most reliable) way to try to reproduce this without Perl? Is
>>>> there
>>>> a HowTo use collations with the CLI sqlite?
>>>>
>>>> --
>>>>
>>>> Michael
>
> ---
> The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.
>
>
>
>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: GROUP BY and ICU collation

David Raymond
Remember that fancy collations don't just look at 1 character at a time, they look at the whole thing, and can do surprising stuff based on that. In this case the order of preference for the collation looks like "when it's part of a larger word, then treating 'S' and 'Š' the same is more important than separating them. But when it's just 1 character then they're different.

So 'S' might be before 'Š', but in words it might go

'Sam'
'Šam'
'Skunk'
'Škunk'
'Sudden'
'Šudden'

rather than a simple character-at-a-time order of

'Sam'
'Skunk'
'Sudden'
'Šam'
'Škunk'
'Šudden'


-----Original Message-----
From: sqlite-users [mailto:[hidden email]] On Behalf Of Michael Herger
Sent: Friday, February 08, 2019 4:13 AM
To: [hidden email]
Subject: Re: [sqlite] GROUP BY and ICU collation

Hi Keith,

thanks for your response (which partly goes beyond my understanding, but
I'm learning :-)).

> But you are not using the same "expression" for selecting, sorting, and grouping.  That is, you need to specify:
>
>    SELECT expression, count(distinct id)
>      FROM artists
> GROUP BY expression
> ORDER BY expression;
>
> where expression is the expression that you want to use
>
>    SELECT substr(name collate de_DE, 1, 1), count(distinct id)
>      FROM artists
> GROUP BY substr(name collate de_DE, 1, 1)
> ORDER BY substr(name collate de_DE, 1, 1);
>
> If you do not do so then you cannot expect the "GROUP BY" to be using the same ordering as the "ORDER BY" nor the select to be returning the same value that was used to do the grouping and sorting.

Ok, tried that:

SELECT SUBSTR(contributors.namesort COLLATE de_DE , 1, 1)
FROM contributors
GROUP BY SUBSTR(contributors.namesort COLLATE de_DE , 1, 1)
ORDER BY SUBSTR(contributors.namesort COLLATE de_DE , 1, 1)

Resulted in:

"A"
"C"
"D"
"P"
"R"
"S"
"Š"
"T"
"W"

Whereas this:

SELECT contributors.name, contributors.namesort COLLATE de_DE
FROM contributors
ORDER BY contributors.namesort COLLATE de_DE

...resulted in this:

...
   ["Roddy Woomble & Band", "RODDY WOOMBLE BAND"],
   ["Samuel Yirga", "SAMUEL YIRGA"],
   ["Stephin Merritt", "STEPHIN MERRITT"],
   ["Šuma Čovjek", "ŠUMA ČOVJEK"],
   ["Syriana", "SYRIANA"],
   ["Tom Griesgraber", "TOM GRIESGRABER"],
...

So despite my using the same expression for the GROUP as for the ORDER,
the grouping seems to ignore the collation, whereas the sorting alone
would not.

I'm using the ICU extension (otherwise the sorting wouldn't work either,
right?). Could that extension be responsible for this odd behaviour?

I've put a small sample .db file in my dropbox:
https://www.dropbox.com/s/w4h6n3cyamt6yc0/library.db?dl=0


>
> (This substitutes the collation NOCASE for the de_DE since I no have a de_DE collation:
>
> sqlite> select substr(name collate nocase, 1, 1), count(distinct id)
>     ...>   FROM artists
>     ...> group by substr(name collate nocase, 1, 1)
>     ...> order by substr(name collate nocase, 1, 1);
> QUERY PLAN
> |--SCAN TABLE artists (~1048576 rows)
> `--USE TEMP B-TREE FOR GROUP BY
> addr  opcode         p1    p2    p3    p4             p5  comment
> ----  -------------  ----  ----  ----  -------------  --  -------------
> 0     Init           0     57    0                    00  Start at 57
> 1     Noop           1     4     0                    00
> 2     SorterOpen     3     3     0     k(1,NOCASE)    00
> 3     Integer        0     5     0                    00  r[5]=0; clear abort flag
> 4     Null           0     8     8                    00  r[8..8]=NULL
> 5     Gosub          7     52    0                    00
> 6     OpenRead       0     3     0     2              00  root=3 iDb=0; artists
> 7     ColumnsUsed    0     0     0     3              00
> 8     Explain        8     0     0     SCAN TABLE artists (~1048576 rows)  00
> 9     Noop           0     0     0                    00  Begin WHERE-loop0: artists
> 10    Rewind         0     20    0                    00
> 11      Noop           0     0     0                    00  Begin WHERE-core
> 12      Column         0     1     13                   00  r[13]=artists.name
> 13      Function0      6     13    10    substr(3)      03  r[10]=func(r[13..15])
> 14      Column         0     1     11                   00  r[11]=artists.name
> 15      Column         0     0     12                   00  r[12]=artists.id
> 16      MakeRecord     10    3     16                   00  r[16]=mkrec(r[10..12])
> 17      SorterInsert   3     16    0                    00  key=r[16]
> 18      Noop           0     0     0                    00  End WHERE-core
> 19    Next           0     11    0                    01
> 20    Noop           0     0     0                    00  End WHERE-loop0: artists
> 21    OpenPseudo     4     16    3                    00  3 columns in r[16]
> 22    SorterSort     3     56    0                    00  GROUP BY sort
> 23      SorterData     3     16    4                    00  r[16]=data
> 24      Column         4     0     9                    00  r[9]=
> 25      Compare        8     9     1     k(1,NOCASE)    00  r[8] <-> r[9]
> 26      Jump           27    31    27                   00
> 27      Move           9     8     1                    00  r[8]=r[9]
> 28      Gosub          6     44    0                    00  output one row
> 29      IfPos          5     56    0                    00  if r[5]>0 then r[5]-=0, goto 56; check abort flag
> 30      Gosub          7     52    0                    00  reset accumulator
> 31      Column         4     2     17                   00  r[17]=
> 32      Found          2     36    17    1              00  key=r[17]
> 33      MakeRecord     17    1     18                   00  r[18]=mkrec(r[17])
> 34      IdxInsert      2     18    17    1              10  key=r[18]
> 35      AggStep        0     17    2     count(1)       01  accum=r[2] step(r[17])
> 36      If             4     38    0                    00
> 37      Column         4     1     1                    00  r[1]=
> 38      Integer        1     4     0                    00  r[4]=1; indicate data in accumulator
> 39    SorterNext     3     23    0                    00
> 40    Gosub          6     44    0                    00  output final row
> 41    Goto           0     56    0                    00
> 42    Integer        1     5     0                    00  r[5]=1; set abort flag
> 43    Return         6     0     0                    00
> 44    IfPos          4     46    0                    00  if r[4]>0 then r[4]-=0, goto 46; Groupby result generator entry point
> 45    Return         6     0     0                    00
> 46    AggFinal       2     1     0     count(1)       00  accum=r[2] N=1
> 47    Copy           1     21    0                    00  r[21]=r[1]
> 48    Function0      6     21    19    substr(3)      03  r[19]=func(r[21..23])
> 49    Copy           2     20    0                    00  r[20]=r[2]
> 50    ResultRow      19    2     0                    00  output=r[19..20]
> 51    Return         6     0     0                    00  end groupby result generator
> 52    Null           0     1     3                    00  r[1..3]=NULL
> 53    OpenEphemeral  2     0     0     k(1,B)         00  nColumn=0
> 54    Integer        0     4     0                    00  r[4]=0; indicate accumulator empty
> 55    Return         7     0     0                    00
> 56    Halt           0     0     0                    00
> 57    Transaction    0     0     2     0              01  usesStmtJournal=0
> 58    Integer        1     14    0                    00  r[14]=1
> 59    Integer        1     15    0                    00  r[15]=1
> 60    Integer        1     22    0                    00  r[22]=1
> 61    Integer        1     23    0                    00  r[23]=1
> 62    Goto           0     1     0                    00
> sqlite>
>
>>
>> Michael
>>
>>
>>>
>>> That is
>>>
>>> select name collate nocase, count(distinct id) from x group by name
>> collate nocase order by name collate nocase
>>>
>>> whill produce cased output not the value that was used for the
>> sorting.
>>>
>>>
>>> select lower(name collate nocase), count(distinct id) from x group
>> by name collate nocase order by name collate nocase;
>>>
>>> to transmorgificate name into a "caseless" representation.  So you
>> would need to do something like this:
>>>
>>> select de_DE(substr(name collate de_DE,1,1)), count(distinct id)
>>from artists
>>> group by substr(name collate de_DE,1,1)
>>> order by by substr(name collate de_DE,1,1)
>>>
>>> and the function de_DE would have to transmorgificate its value to
>> the result you want to see.
>>>
>>> ---
>>> The fact that there's a Highway to Hell but only a Stairway to
>> Heaven says a lot about anticipated traffic volume.
>>>
>>>> -----Original Message-----
>>>> From: sqlite-users [mailto:sqlite-users-
>>>> [hidden email]] On Behalf Of [hidden email]
>>>> Sent: Thursday, 7 February, 2019 05:12
>>>> To: [hidden email]
>>>> Subject: [sqlite] GROUP BY and ICU collation
>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> I'm trying to create a list with an index list. Eg. I have
>>>> artists:
>>>>>>
>>>>>> Sting
>>>>>> Šuma Čovjek
>>>>>> Suzanne Vega
>>>>>>
>>>>>> That's the sort order I'd get using an ICU collation. "Šuma
>>>> Čovjek"
>>>>>> would be sorted as "Suma..." as expected.
>>>>>>
>>>>>> Now I'd like to create an index bar by providing groups of the
>>>> first
>>>>>> character:
>>>>>>
>>>>>> SELECT SUBSTR(name,1,1), count(distinct id) FROM artists GROUP
>> BY
>>>>>> SUBSTR(name,1,1) ORDER BY name COLLATE de_DE>>
>>>>> Aren't you missing a COLLATE clause after the GROUP BY term?
>>>>>
>>>>>        ... GROUP BY SUBSTR(name,1,1) COLLATE de_DE ...
>>>>
>>>> TBH: I didn't even know about this. I thought the COLLATE at the
>> end
>>>> of
>>>> the statement would do it for all.
>>>>
>>>> Alas, tried again to no avail. No matter whether I add it after
>> the
>>>> GROUP BY or not, the result is the same.
>>>>
>>>> I should probably have added some version information: I'm using
>> the
>>>> Perl DBD::SQLite 1.58 (sqlite 3.22.0). I didn't see any mention of
>>>> related changes in the changelog for SQLite. What would be the
>>>> easiest
>>>> (and most reliable) way to try to reproduce this without Perl? Is
>>>> there
>>>> a HowTo use collations with the CLI sqlite?
>>>>
>>>> --
>>>>
>>>> Michael
>
> ---
> The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.
>
>
>
>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: GROUP BY and ICU collation

Richard Damon
On 2/8/19 9:11 AM, David Raymond wrote:

> Remember that fancy collations don't just look at 1 character at a time, they look at the whole thing, and can do surprising stuff based on that. In this case the order of preference for the collation looks like "when it's part of a larger word, then treating 'S' and 'Š' the same is more important than separating them. But when it's just 1 character then they're different.
>
> So 'S' might be before 'Š', but in words it might go
>
> 'Sam'
> 'Šam'
> 'Skunk'
> 'Škunk'
> 'Sudden'
> 'Šudden'
>
> rather than a simple character-at-a-time order of
>
> 'Sam'
> 'Skunk'
> 'Sudden'
> 'Šam'
> 'Škunk'
> 'Šudden'
>
Actually, the way those collations work is that on first pass, S and Š
compare equal, but if two words on first pass compare equal, then
effectively a second pass take place, and on the second pass, S and Š
compare with an order (I thought I remembers there even being some cases
that needed a third pass).

--
Richard Damon

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: GROUP BY and ICU collation

Keith Medcalf
In reply to this post by Michael Herger

Must be whatever the ICU collating sequence does.  It apparently sorts into an order you like, but does not sort the characters as being "the same".  Unless they are "the same" they will not be in the same group.

The unifuzz "unaccented" collation does sort the two characters as "the same" (note that it is just happenstance what character you got for "S"):

sqlite> select substr(name collate unaccented,1,1) from contributors group by substr(name collate unaccented,1,1);
QUERY PLAN
|--SCAN TABLE contributors USING COVERING INDEX contributorsNameIndex (~48 rows)
`--USE TEMP B-TREE FOR GROUP BY
addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     47    0                    00  Start at 47
1     SorterOpen     1     2     0     k(1,UNACCENTED)  00
2     Integer        0     3     0                    00  r[3]=0; clear abort flag
3     Null           0     6     6                    00  r[6..6]=NULL
4     Gosub          5     43    0                    00
5     OpenRead       2     5     0     k(2,,)         00  root=5 iDb=0; contributorsNameIndex
6     ColumnsUsed    2     0     0     1              00
7     Explain        7     0     0     SCAN TABLE contributors USING COVERING INDEX contributorsNameIndex (~48 rows)  00            
8     Noop           0     0     0                    00  Begin WHERE-loop0: contributors
9     Rewind         2     18    8     0              00
10      Noop           0     0     0                    00  Begin WHERE-core
11      Column         2     0     10                   00  r[10]=contributors.name
12      Function0      6     10    8     substr(3)      03  r[8]=func(r[10..12])
13      Column         2     0     9                    00  r[9]=contributors.name
14      MakeRecord     8     2     13                   00  r[13]=mkrec(r[8..9])
15      SorterInsert   1     13    0                    00  key=r[13]
16      Noop           0     0     0                    00  End WHERE-core
17    Next           2     10    0                    01
18    Noop           0     0     0                    00  End WHERE-loop0: contributors
19    OpenPseudo     3     13    2                    00  2 columns in r[13]
20    SorterSort     1     46    0                    00  GROUP BY sort
21      SorterData     1     13    3                    00  r[13]=data
22      Column         3     0     7                    00  r[7]=
23      Compare        6     7     1     k(1,UNACCENTED)  00  r[6] <-> r[7]
24      Jump           25    29    25                   00
25      Move           7     6     1                    00  r[6]=r[7]
26      Gosub          4     37    0                    00  output one row
27      IfPos          3     46    0                    00  if r[3]>0 then r[3]-=0, goto 46; check abort flag
28      Gosub          5     43    0                    00  reset accumulator
29      If             2     31    0                    00
30      Column         3     1     1                    00  r[1]=
31      Integer        1     2     0                    00  r[2]=1; indicate data in accumulator
32    SorterNext     1     21    0                    00
33    Gosub          4     37    0                    00  output final row
34    Goto           0     46    0                    00
35    Integer        1     3     0                    00  r[3]=1; set abort flag
36    Return         4     0     0                    00
37    IfPos          2     39    0                    00  if r[2]>0 then r[2]-=0, goto 39; Groupby result generator entry point
38    Return         4     0     0                    00
39    Copy           1     15    0                    00  r[15]=r[1]
40    Function0      6     15    14    substr(3)      03  r[14]=func(r[15..17])
41    ResultRow      14    1     0                    00  output=r[14]
42    Return         4     0     0                    00  end groupby result generator
43    Null           0     1     1                    00  r[1..1]=NULL
44    Integer        0     2     0                    00  r[2]=0; indicate accumulator empty
45    Return         5     0     0                    00
46    Halt           0     0     0                    00
47    Transaction    0     0     518   0              01  usesStmtJournal=0
48    Integer        1     11    0                    00  r[11]=1
49    Integer        1     12    0                    00  r[12]=1
50    Integer        1     16    0                    00  r[16]=1
51    Integer        1     17    0                    00  r[17]=1
52    Goto           0     1     0                    00
A
B
C
D
E
F
G
I
J
K
L
M
P
R
S
T
W
sqlite>

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.

>-----Original Message-----
>From: sqlite-users [mailto:sqlite-users-
>[hidden email]] On Behalf Of Michael Herger
>Sent: Friday, 8 February, 2019 02:13
>To: [hidden email]
>Subject: Re: [sqlite] GROUP BY and ICU collation
>
>Hi Keith,
>
>thanks for your response (which partly goes beyond my understanding,
>but
>I'm learning :-)).
>
>> But you are not using the same "expression" for selecting, sorting,
>and grouping.  That is, you need to specify:
>>
>>    SELECT expression, count(distinct id)
>>      FROM artists
>> GROUP BY expression
>> ORDER BY expression;
>>
>> where expression is the expression that you want to use
>>
>>    SELECT substr(name collate de_DE, 1, 1), count(distinct id)
>>      FROM artists
>> GROUP BY substr(name collate de_DE, 1, 1)
>> ORDER BY substr(name collate de_DE, 1, 1);
>>
>> If you do not do so then you cannot expect the "GROUP BY" to be
>using the same ordering as the "ORDER BY" nor the select to be
>returning the same value that was used to do the grouping and
>sorting.
>
>Ok, tried that:
>
>SELECT SUBSTR(contributors.namesort COLLATE de_DE , 1, 1)
>FROM contributors
>GROUP BY SUBSTR(contributors.namesort COLLATE de_DE , 1, 1)
>ORDER BY SUBSTR(contributors.namesort COLLATE de_DE , 1, 1)
>
>Resulted in:
>
>"A"
>"C"
>"D"
>"P"
>"R"
>"S"
>"Š"
>"T"
>"W"
>
>Whereas this:
>
>SELECT contributors.name, contributors.namesort COLLATE de_DE
>FROM contributors
>ORDER BY contributors.namesort COLLATE de_DE
>
>...resulted in this:
>
>...
>   ["Roddy Woomble & Band", "RODDY WOOMBLE BAND"],
>   ["Samuel Yirga", "SAMUEL YIRGA"],
>   ["Stephin Merritt", "STEPHIN MERRITT"],
>   ["Šuma Čovjek", "ŠUMA ČOVJEK"],
>   ["Syriana", "SYRIANA"],
>   ["Tom Griesgraber", "TOM GRIESGRABER"],
>...
>
>So despite my using the same expression for the GROUP as for the
>ORDER,
>the grouping seems to ignore the collation, whereas the sorting alone
>would not.
>
>I'm using the ICU extension (otherwise the sorting wouldn't work
>either,
>right?). Could that extension be responsible for this odd behaviour?
>
>I've put a small sample .db file in my dropbox:
>https://www.dropbox.com/s/w4h6n3cyamt6yc0/library.db?dl=0
>
>
>>
>> (This substitutes the collation NOCASE for the de_DE since I no
>have a de_DE collation:
>>
>> sqlite> select substr(name collate nocase, 1, 1), count(distinct
>id)
>>     ...>   FROM artists
>>     ...> group by substr(name collate nocase, 1, 1)
>>     ...> order by substr(name collate nocase, 1, 1);
>> QUERY PLAN
>> |--SCAN TABLE artists (~1048576 rows)
>> `--USE TEMP B-TREE FOR GROUP BY
>> addr  opcode         p1    p2    p3    p4             p5  comment
>> ----  -------------  ----  ----  ----  -------------  --  ---------
>----
>> 0     Init           0     57    0                    00  Start at
>57
>> 1     Noop           1     4     0                    00
>> 2     SorterOpen     3     3     0     k(1,NOCASE)    00
>> 3     Integer        0     5     0                    00  r[5]=0;
>clear abort flag
>> 4     Null           0     8     8                    00
>r[8..8]=NULL
>> 5     Gosub          7     52    0                    00
>> 6     OpenRead       0     3     0     2              00  root=3
>iDb=0; artists
>> 7     ColumnsUsed    0     0     0     3              00
>> 8     Explain        8     0     0     SCAN TABLE artists (~1048576
>rows)  00
>> 9     Noop           0     0     0                    00  Begin
>WHERE-loop0: artists
>> 10    Rewind         0     20    0                    00
>> 11      Noop           0     0     0                    00  Begin
>WHERE-core
>> 12      Column         0     1     13                   00
>r[13]=artists.name
>> 13      Function0      6     13    10    substr(3)      03
>r[10]=func(r[13..15])
>> 14      Column         0     1     11                   00
>r[11]=artists.name
>> 15      Column         0     0     12                   00
>r[12]=artists.id
>> 16      MakeRecord     10    3     16                   00
>r[16]=mkrec(r[10..12])
>> 17      SorterInsert   3     16    0                    00
>key=r[16]
>> 18      Noop           0     0     0                    00  End
>WHERE-core
>> 19    Next           0     11    0                    01
>> 20    Noop           0     0     0                    00  End
>WHERE-loop0: artists
>> 21    OpenPseudo     4     16    3                    00  3 columns
>in r[16]
>> 22    SorterSort     3     56    0                    00  GROUP BY
>sort
>> 23      SorterData     3     16    4                    00
>r[16]=data
>> 24      Column         4     0     9                    00  r[9]=
>> 25      Compare        8     9     1     k(1,NOCASE)    00  r[8] <-
>> r[9]
>> 26      Jump           27    31    27                   00
>> 27      Move           9     8     1                    00
>r[8]=r[9]
>> 28      Gosub          6     44    0                    00  output
>one row
>> 29      IfPos          5     56    0                    00  if
>r[5]>0 then r[5]-=0, goto 56; check abort flag
>> 30      Gosub          7     52    0                    00  reset
>accumulator
>> 31      Column         4     2     17                   00  r[17]=
>> 32      Found          2     36    17    1              00
>key=r[17]
>> 33      MakeRecord     17    1     18                   00
>r[18]=mkrec(r[17])
>> 34      IdxInsert      2     18    17    1              10
>key=r[18]
>> 35      AggStep        0     17    2     count(1)       01
>accum=r[2] step(r[17])
>> 36      If             4     38    0                    00
>> 37      Column         4     1     1                    00  r[1]=
>> 38      Integer        1     4     0                    00  r[4]=1;
>indicate data in accumulator
>> 39    SorterNext     3     23    0                    00
>> 40    Gosub          6     44    0                    00  output
>final row
>> 41    Goto           0     56    0                    00
>> 42    Integer        1     5     0                    00  r[5]=1;
>set abort flag
>> 43    Return         6     0     0                    00
>> 44    IfPos          4     46    0                    00  if r[4]>0
>then r[4]-=0, goto 46; Groupby result generator entry point
>> 45    Return         6     0     0                    00
>> 46    AggFinal       2     1     0     count(1)       00
>accum=r[2] N=1
>> 47    Copy           1     21    0                    00
>r[21]=r[1]
>> 48    Function0      6     21    19    substr(3)      03
>r[19]=func(r[21..23])
>> 49    Copy           2     20    0                    00
>r[20]=r[2]
>> 50    ResultRow      19    2     0                    00
>output=r[19..20]
>> 51    Return         6     0     0                    00  end
>groupby result generator
>> 52    Null           0     1     3                    00
>r[1..3]=NULL
>> 53    OpenEphemeral  2     0     0     k(1,B)         00  nColumn=0
>> 54    Integer        0     4     0                    00  r[4]=0;
>indicate accumulator empty
>> 55    Return         7     0     0                    00
>> 56    Halt           0     0     0                    00
>> 57    Transaction    0     0     2     0              01
>usesStmtJournal=0
>> 58    Integer        1     14    0                    00  r[14]=1
>> 59    Integer        1     15    0                    00  r[15]=1
>> 60    Integer        1     22    0                    00  r[22]=1
>> 61    Integer        1     23    0                    00  r[23]=1
>> 62    Goto           0     1     0                    00
>> sqlite>
>>
>>>
>>> Michael
>>>
>>>
>>>>
>>>> That is
>>>>
>>>> select name collate nocase, count(distinct id) from x group by
>name
>>> collate nocase order by name collate nocase
>>>>
>>>> whill produce cased output not the value that was used for the
>>> sorting.
>>>>
>>>>
>>>> select lower(name collate nocase), count(distinct id) from x
>group
>>> by name collate nocase order by name collate nocase;
>>>>
>>>> to transmorgificate name into a "caseless" representation.  So
>you
>>> would need to do something like this:
>>>>
>>>> select de_DE(substr(name collate de_DE,1,1)), count(distinct id)
>>>from artists
>>>> group by substr(name collate de_DE,1,1)
>>>> order by by substr(name collate de_DE,1,1)
>>>>
>>>> and the function de_DE would have to transmorgificate its value
>to
>>> the result you want to see.
>>>>
>>>> ---
>>>> The fact that there's a Highway to Hell but only a Stairway to
>>> Heaven says a lot about anticipated traffic volume.
>>>>
>>>>> -----Original Message-----
>>>>> From: sqlite-users [mailto:sqlite-users-
>>>>> [hidden email]] On Behalf Of [hidden email]
>>>>> Sent: Thursday, 7 February, 2019 05:12
>>>>> To: [hidden email]
>>>>> Subject: [sqlite] GROUP BY and ICU collation
>>>>>
>>>>>>> Hi there,
>>>>>>>
>>>>>>> I'm trying to create a list with an index list. Eg. I have
>>>>> artists:
>>>>>>>
>>>>>>> Sting
>>>>>>> Šuma Čovjek
>>>>>>> Suzanne Vega
>>>>>>>
>>>>>>> That's the sort order I'd get using an ICU collation. "Šuma
>>>>> Čovjek"
>>>>>>> would be sorted as "Suma..." as expected.
>>>>>>>
>>>>>>> Now I'd like to create an index bar by providing groups of the
>>>>> first
>>>>>>> character:
>>>>>>>
>>>>>>> SELECT SUBSTR(name,1,1), count(distinct id) FROM artists GROUP
>>> BY
>>>>>>> SUBSTR(name,1,1) ORDER BY name COLLATE de_DE>>
>>>>>> Aren't you missing a COLLATE clause after the GROUP BY term?
>>>>>>
>>>>>>        ... GROUP BY SUBSTR(name,1,1) COLLATE de_DE ...
>>>>>
>>>>> TBH: I didn't even know about this. I thought the COLLATE at the
>>> end
>>>>> of
>>>>> the statement would do it for all.
>>>>>
>>>>> Alas, tried again to no avail. No matter whether I add it after
>>> the
>>>>> GROUP BY or not, the result is the same.
>>>>>
>>>>> I should probably have added some version information: I'm using
>>> the
>>>>> Perl DBD::SQLite 1.58 (sqlite 3.22.0). I didn't see any mention
>of
>>>>> related changes in the changelog for SQLite. What would be the
>>>>> easiest
>>>>> (and most reliable) way to try to reproduce this without Perl?
>Is
>>>>> there
>>>>> a HowTo use collations with the CLI sqlite?
>>>>>
>>>>> --
>>>>>
>>>>> Michael
>>
>> ---
>> The fact that there's a Highway to Hell but only a Stairway to
>Heaven says a lot about anticipated traffic volume.
>>
>>
>>
>>
>> _______________________________________________
>> sqlite-users mailing list
>> [hidden email]
>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-
>users
>>
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: GROUP BY and ICU collation

Michael Herger
In reply to this post by David Raymond
> Remember that fancy collations don't just look at 1 character at a time, they look at the whole thing, and can do surprising stuff based on that. In this case the order of preference for the collation looks like "when it's part of a larger word, then treating 'S' and 'Š' the same is more important than separating them. But when it's just 1 character then they're different.

Uh... oh... that might explain things... Then I'll have to re-think
things considerably.

Thank you all for your input!

Michael


>
> So 'S' might be before 'Š', but in words it might go
>
> 'Sam'
> 'Šam'
> 'Skunk'
> 'Škunk'
> 'Sudden'
> 'Šudden'
>
> rather than a simple character-at-a-time order of
>
> 'Sam'
> 'Skunk'
> 'Sudden'
> 'Šam'
> 'Škunk'
> 'Šudden'
>
>
> -----Original Message-----
> From: sqlite-users [mailto:[hidden email]] On Behalf Of Michael Herger
> Sent: Friday, February 08, 2019 4:13 AM
> To: [hidden email]
> Subject: Re: [sqlite] GROUP BY and ICU collation
>
> Hi Keith,
>
> thanks for your response (which partly goes beyond my understanding, but
> I'm learning :-)).
>
>> But you are not using the same "expression" for selecting, sorting, and grouping.  That is, you need to specify:
>>
>>     SELECT expression, count(distinct id)
>>       FROM artists
>> GROUP BY expression
>> ORDER BY expression;
>>
>> where expression is the expression that you want to use
>>
>>     SELECT substr(name collate de_DE, 1, 1), count(distinct id)
>>       FROM artists
>> GROUP BY substr(name collate de_DE, 1, 1)
>> ORDER BY substr(name collate de_DE, 1, 1);
>>
>> If you do not do so then you cannot expect the "GROUP BY" to be using the same ordering as the "ORDER BY" nor the select to be returning the same value that was used to do the grouping and sorting.
>
> Ok, tried that:
>
> SELECT SUBSTR(contributors.namesort COLLATE de_DE , 1, 1)
> FROM contributors
> GROUP BY SUBSTR(contributors.namesort COLLATE de_DE , 1, 1)
> ORDER BY SUBSTR(contributors.namesort COLLATE de_DE , 1, 1)
>
> Resulted in:
>
> "A"
> "C"
> "D"
> "P"
> "R"
> "S"
> "Š"
> "T"
> "W"
>
> Whereas this:
>
> SELECT contributors.name, contributors.namesort COLLATE de_DE
> FROM contributors
> ORDER BY contributors.namesort COLLATE de_DE
>
> ...resulted in this:
>
> ...
>     ["Roddy Woomble & Band", "RODDY WOOMBLE BAND"],
>     ["Samuel Yirga", "SAMUEL YIRGA"],
>     ["Stephin Merritt", "STEPHIN MERRITT"],
>     ["Šuma Čovjek", "ŠUMA ČOVJEK"],
>     ["Syriana", "SYRIANA"],
>     ["Tom Griesgraber", "TOM GRIESGRABER"],
> ...
>
> So despite my using the same expression for the GROUP as for the ORDER,
> the grouping seems to ignore the collation, whereas the sorting alone
> would not.
>
> I'm using the ICU extension (otherwise the sorting wouldn't work either,
> right?). Could that extension be responsible for this odd behaviour?
>
> I've put a small sample .db file in my dropbox:
> https://www.dropbox.com/s/w4h6n3cyamt6yc0/library.db?dl=0
>
>
>>
>> (This substitutes the collation NOCASE for the de_DE since I no have a de_DE collation:
>>
>> sqlite> select substr(name collate nocase, 1, 1), count(distinct id)
>>      ...>   FROM artists
>>      ...> group by substr(name collate nocase, 1, 1)
>>      ...> order by substr(name collate nocase, 1, 1);
>> QUERY PLAN
>> |--SCAN TABLE artists (~1048576 rows)
>> `--USE TEMP B-TREE FOR GROUP BY
>> addr  opcode         p1    p2    p3    p4             p5  comment
>> ----  -------------  ----  ----  ----  -------------  --  -------------
>> 0     Init           0     57    0                    00  Start at 57
>> 1     Noop           1     4     0                    00
>> 2     SorterOpen     3     3     0     k(1,NOCASE)    00
>> 3     Integer        0     5     0                    00  r[5]=0; clear abort flag
>> 4     Null           0     8     8                    00  r[8..8]=NULL
>> 5     Gosub          7     52    0                    00
>> 6     OpenRead       0     3     0     2              00  root=3 iDb=0; artists
>> 7     ColumnsUsed    0     0     0     3              00
>> 8     Explain        8     0     0     SCAN TABLE artists (~1048576 rows)  00
>> 9     Noop           0     0     0                    00  Begin WHERE-loop0: artists
>> 10    Rewind         0     20    0                    00
>> 11      Noop           0     0     0                    00  Begin WHERE-core
>> 12      Column         0     1     13                   00  r[13]=artists.name
>> 13      Function0      6     13    10    substr(3)      03  r[10]=func(r[13..15])
>> 14      Column         0     1     11                   00  r[11]=artists.name
>> 15      Column         0     0     12                   00  r[12]=artists.id
>> 16      MakeRecord     10    3     16                   00  r[16]=mkrec(r[10..12])
>> 17      SorterInsert   3     16    0                    00  key=r[16]
>> 18      Noop           0     0     0                    00  End WHERE-core
>> 19    Next           0     11    0                    01
>> 20    Noop           0     0     0                    00  End WHERE-loop0: artists
>> 21    OpenPseudo     4     16    3                    00  3 columns in r[16]
>> 22    SorterSort     3     56    0                    00  GROUP BY sort
>> 23      SorterData     3     16    4                    00  r[16]=data
>> 24      Column         4     0     9                    00  r[9]=
>> 25      Compare        8     9     1     k(1,NOCASE)    00  r[8] <-> r[9]
>> 26      Jump           27    31    27                   00
>> 27      Move           9     8     1                    00  r[8]=r[9]
>> 28      Gosub          6     44    0                    00  output one row
>> 29      IfPos          5     56    0                    00  if r[5]>0 then r[5]-=0, goto 56; check abort flag
>> 30      Gosub          7     52    0                    00  reset accumulator
>> 31      Column         4     2     17                   00  r[17]=
>> 32      Found          2     36    17    1              00  key=r[17]
>> 33      MakeRecord     17    1     18                   00  r[18]=mkrec(r[17])
>> 34      IdxInsert      2     18    17    1              10  key=r[18]
>> 35      AggStep        0     17    2     count(1)       01  accum=r[2] step(r[17])
>> 36      If             4     38    0                    00
>> 37      Column         4     1     1                    00  r[1]=
>> 38      Integer        1     4     0                    00  r[4]=1; indicate data in accumulator
>> 39    SorterNext     3     23    0                    00
>> 40    Gosub          6     44    0                    00  output final row
>> 41    Goto           0     56    0                    00
>> 42    Integer        1     5     0                    00  r[5]=1; set abort flag
>> 43    Return         6     0     0                    00
>> 44    IfPos          4     46    0                    00  if r[4]>0 then r[4]-=0, goto 46; Groupby result generator entry point
>> 45    Return         6     0     0                    00
>> 46    AggFinal       2     1     0     count(1)       00  accum=r[2] N=1
>> 47    Copy           1     21    0                    00  r[21]=r[1]
>> 48    Function0      6     21    19    substr(3)      03  r[19]=func(r[21..23])
>> 49    Copy           2     20    0                    00  r[20]=r[2]
>> 50    ResultRow      19    2     0                    00  output=r[19..20]
>> 51    Return         6     0     0                    00  end groupby result generator
>> 52    Null           0     1     3                    00  r[1..3]=NULL
>> 53    OpenEphemeral  2     0     0     k(1,B)         00  nColumn=0
>> 54    Integer        0     4     0                    00  r[4]=0; indicate accumulator empty
>> 55    Return         7     0     0                    00
>> 56    Halt           0     0     0                    00
>> 57    Transaction    0     0     2     0              01  usesStmtJournal=0
>> 58    Integer        1     14    0                    00  r[14]=1
>> 59    Integer        1     15    0                    00  r[15]=1
>> 60    Integer        1     22    0                    00  r[22]=1
>> 61    Integer        1     23    0                    00  r[23]=1
>> 62    Goto           0     1     0                    00
>> sqlite>
>>
>>>
>>> Michael
>>>
>>>
>>>>
>>>> That is
>>>>
>>>> select name collate nocase, count(distinct id) from x group by name
>>> collate nocase order by name collate nocase
>>>>
>>>> whill produce cased output not the value that was used for the
>>> sorting.
>>>>
>>>>
>>>> select lower(name collate nocase), count(distinct id) from x group
>>> by name collate nocase order by name collate nocase;
>>>>
>>>> to transmorgificate name into a "caseless" representation.  So you
>>> would need to do something like this:
>>>>
>>>> select de_DE(substr(name collate de_DE,1,1)), count(distinct id)
>> >from artists
>>>> group by substr(name collate de_DE,1,1)
>>>> order by by substr(name collate de_DE,1,1)
>>>>
>>>> and the function de_DE would have to transmorgificate its value to
>>> the result you want to see.
>>>>
>>>> ---
>>>> The fact that there's a Highway to Hell but only a Stairway to
>>> Heaven says a lot about anticipated traffic volume.
>>>>
>>>>> -----Original Message-----
>>>>> From: sqlite-users [mailto:sqlite-users-
>>>>> [hidden email]] On Behalf Of [hidden email]
>>>>> Sent: Thursday, 7 February, 2019 05:12
>>>>> To: [hidden email]
>>>>> Subject: [sqlite] GROUP BY and ICU collation
>>>>>
>>>>>>> Hi there,
>>>>>>>
>>>>>>> I'm trying to create a list with an index list. Eg. I have
>>>>> artists:
>>>>>>>
>>>>>>> Sting
>>>>>>> Šuma Čovjek
>>>>>>> Suzanne Vega
>>>>>>>
>>>>>>> That's the sort order I'd get using an ICU collation. "Šuma
>>>>> Čovjek"
>>>>>>> would be sorted as "Suma..." as expected.
>>>>>>>
>>>>>>> Now I'd like to create an index bar by providing groups of the
>>>>> first
>>>>>>> character:
>>>>>>>
>>>>>>> SELECT SUBSTR(name,1,1), count(distinct id) FROM artists GROUP
>>> BY
>>>>>>> SUBSTR(name,1,1) ORDER BY name COLLATE de_DE>>
>>>>>> Aren't you missing a COLLATE clause after the GROUP BY term?
>>>>>>
>>>>>>         ... GROUP BY SUBSTR(name,1,1) COLLATE de_DE ...
>>>>>
>>>>> TBH: I didn't even know about this. I thought the COLLATE at the
>>> end
>>>>> of
>>>>> the statement would do it for all.
>>>>>
>>>>> Alas, tried again to no avail. No matter whether I add it after
>>> the
>>>>> GROUP BY or not, the result is the same.
>>>>>
>>>>> I should probably have added some version information: I'm using
>>> the
>>>>> Perl DBD::SQLite 1.58 (sqlite 3.22.0). I didn't see any mention of
>>>>> related changes in the changelog for SQLite. What would be the
>>>>> easiest
>>>>> (and most reliable) way to try to reproduce this without Perl? Is
>>>>> there
>>>>> a HowTo use collations with the CLI sqlite?
>>>>>
>>>>> --
>>>>>
>>>>> Michael
>>
>> ---
>> The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.
>>
>>
>>
>>
>> _______________________________________________
>> sqlite-users mailing list
>> [hidden email]
>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>

--

Michael
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users