"Distinct" and "Order by" in "group_concat"

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

"Distinct" and "Order by" in "group_concat"

Jean-Luc Hainaut
Hi,

I have two questions/observations/suggestions related to SQLite function "group_concat".
This function is extremely useful since it replaces in an elegant and concise way quite tricky procedures. However, it misses two features of which I would like to discuss the work-around.

Let's consider table P(PiD,Age,City), giving, for a set of persons, their age and city. PiD is the PK.

create table P(PiD integer not null primary key,Age integer,City text);
insert into P values (1,35,'B'),(2,32,'B'),(3,35,'A'),(4,35,'B'),(5,32,'A');

+-----+-----+------+
| PiD | Age | City |
+-----+-----+------+
| 1   | 35  | B    |
| 2   | 32  | B    |
| 3   | 35  | A    |
| 4   | 35  | B    |
| 5   | 32  | A    |
+-----+-----+------+


1. The case of the DISTINCT clause in "group_concat" function

I want to extract a table that assigns to each age the list of cities in which some persons of this age live:

select Age,group_concat(City) as Cities
from P
group by Age;

+-----+--------+
| Age | Cities |
+-----+--------+
| 32  | B,A    |
| 35  | B,A,B  |
+-----+--------+

Duplicates can be removed through quantifier "distinct":

select Age,group_concat(distinct City) as Cities
from P
group by Age;

+-----+--------+
| Age | Cities |
+-----+--------+
| 32  | B,A    |
| 35  | B,A    |
+-----+--------+

Now, I would like to change the default separator (",") for ";". No problem "with duplicates":

select Age,group_concat(City,';') as Cities
from P
group by Age;

+-----+--------+
| Age | Cities |
+-----+--------+
| 32  | B;A    |
| 35  | B;A;B  |
+-----+--------+

But this does not work when I ask to reduce duplicates:

select Age,group_concat(distinct City,';') as Cities
from P
group by Age;

We get the error message: "DISTINCT aggregates must have exactly one argument".
Clearly, the analyzer doesn't consider the second argument as the separator but as an additional element to concatenate, which is invalid (the documentation says "In any aggregate function that takes a single argument, that argument can be preceded by the keyword DISTINCT").

This can be solved, but at the cost of increased complexity (and perhaps execution time), by removing duplicates in a subquery "from" clause ...:

select Age,group_concat(City,';') as Cities
from (select distinct Age,City
      from P)
group by Age;

+-----+--------+
| Age | Cities |
+-----+--------+
| 32  | B;A    |
| 35  | B;A    |
+-----+--------+

... or in a subquery in the select-list:

select Age,(select group_concat(City,';')
            from P
            where Age = AP.Age
            group by Age) as Cities
from P AP
group by Age;

.. or even by replacing default ',' with ';' in the resulting concatenated list, provided no comma appears in the concatenated elements:

select Age,replace(group_concat(distinct City),',',';') as Cities
from P group by Age;

A more drastic solution would be to code a new, user-defined, aggregate function that simulates a "group_concat(distinct ...,separator)" function.


My question/suggestion
----------------------------------
Is there a better way to reduce duplicates with the current version of SQLite?
Could we imagine, in a future version of SQLite, an extension of "group_concat" with both "distinct" and user-defined separator? If we can, I will support it!


2. The case of the ORDER BY clause in "group_concat" function

My second concern is about the way to sort the items of the "group_concat" result, as is possible in some other DBMS. E.g., how to get such sorted lists in SQLite:

+-----+--------+
| Age | Cities |
+-----+--------+
| 32  | A,B    |
| 35  | A,B,B  |
+-----+--------+
or
+-----+--------+
| Age | Cities |
+-----+--------+
| 32  | A,B    |
| 35  | A,B    |
+-----+--------+

The usual technique (generally found on the web) to sort the lists is through a subquery "from" clause in which table P (or a projection of it) is pre-sorted on the elements to aggregate:

select Age,group_concat(City) as Cities
from (select Age,City
      from P
      order by City)
-- or (more coercive for the optimizer): order by Age,City)
group by Age;

This works fine in all the tests I have caried out.
However, this technique relies on the fact that the order is preserved and is exploited when executing the outer query.
Though this hypothesis seems quite natural, theoretically, it depends on multiple factors: size of the source data, indexes, other clauses of the source query that may require different row ordering, current strategies of the optimizer (that may evolve in next versions). In short, unless the hypothesis formulated above can be taken for certain and definitive in the long term, this technique may be felt unstable and unsecure.

My question
-----------------
Can we be made sure that this form will always, in all cases, produce the desired element ordering in the concatenated list?

Thanks for your comments.

Jean-Luc Hainaut

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users