Group contiguous rows (islands)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Group contiguous rows (islands)

Rossel, Jonathan
Dear all,

I need to perform a kind of partial GROUP BY to determine the beginnings and ends of sets of identical data. I can't use a full GROUP BY because these sets can be repeated and their repetition must be conserved. Other database engines have solutions for this task (like windowing in postgre) but I wonder if there is an efficient recipe in SQLite.

Example:
=======

Table: mytable
--------

date     test
------     ------
1 clim
3 clim
7 amb
10 amb
13 clim
15 clim
20 clim
22 amb
25 amb

Desired result
---------------------

date_from    date_to    test
---------------    ------------   ------
1                 3              clim
7                 10            amb
13               15             clim
22               25             amb


(non optimal) solution found
=====================

CREATE VIEW mytablebydate
AS  -- Pre-order table to avoid ordering it twice in sub-queries
SELECT * FROM mytable ORDER BY date

CREATE VIEW mytablenext
AS
SELECT  date,
                test,
                (
                -- first row > this row
        SELECT date           -- NULL if not exists
                FROM mytablebydate
                WHERE date > MT.date
                LIMIT 1
                ) as date_next,
                (
                -- first row > this row
        SELECT test           -- NULL if not exists
                FROM mytablebydate
                WHERE date > MT.date
                LIMIT 1
                ) as test_next
FROM mytable MT
WHERE test != test_next

-- Get desired results
SELECT  (
                --  Date of the previous row
                SELECT MAX( date_next )
                        FROM mytablenext
                        WHERE date_next < mt.date
                ) AS date_from,
       
                date AS date_to,       -- this row
                test
FROM mytablenext mt


Comments
========

This method returns a Null for the first date_from and the last group is not returned. It is therefore incomplete. In addition, it involves quite a lot of subqueries. For completeness, it is inspired by http://stackoverflow.com/questions/30455227/date-range-for-set-of-same-data/30460263#30460263. So, is there a better / official way in SQLite ?

Any help will be welcome,

Jonathan
       
*******************************************************************************
This e-mail message is intended only for the addressee(s) and contains
information which may be confidential. If you are not the intended
recipient please do not read, save, forward, disclose or copy the contents
of this e-mail. If this e-mail has been sent to you in error, please delete this
e-mail and any copies or links to this e-mail completely and immediately
from your system. We also like to inform you that communication via e-mail
over the Internet is insecure because third parties may have the possibility
to access and manipulate e-mails.

Any views expressed in this message are those of the individual sender,
except where the sender specifically states them to be the views of
The Swatch Group Ltd.
*******************************************************************************
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Group contiguous rows (islands)

Clemens Ladisch
Rossel, Jonathan wrote:
> Other database engines have solutions for this task (like windowing in
> postgre) but I wonder if there is an efficient recipe in SQLite.

SQLite does not have windowing functions.  So the most efficient method
would be to read the data with a simple ORDER BY, and do the grouping
in your code.


Regards,
Clemens


> This e-mail message is intended only ...

This e-mail contains public information intended for any subscriber of
this mailing list and for anybody else who bothers to read it; it will
be copied, disclosed and distributed to the public.  If you think you
are not the intended recipient, please commit suicide immediately.
These terms apply also to any e-mails quoted in, referenced from, or
answering this e-mail, and supersede any confidentiality notices in
those e-mails.  Additionally, confidentiality notices in those e-mails
will incur legal processing fees of $42 per line; you have agreed to
this by reading this confidentiality notice.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Group contiguous rows (islands)

Jean-Luc Hainaut
In reply to this post by Rossel, Jonathan

You could try this, inspired by classic algorithms of temporal databases:

create table T(date integer,test char(12));
insert into T
values (1,'clim'),(3,'clim'),(7,'amb'),(10,'amb'),(12,'xxx'),
(13,'clim'),(15,'clim'),(20,'clim'),(22,'amb'),(25,'amb');

create table TT(seq integer not null primary key autoincrement,date
integer,test char(12));
insert into TT(date,test) select * from T order by date;

select T1.date, T3.date, T1.test
from   TT T1, TT T3
-- More efficient than "where  T1.date <= T3.date"
where  T1.seq <= T3.seq
and    T1.test = T3.test
and    not exists(select * from TT where seq = T1.seq-1 and test = T1.test)
and    not exists(select * from TT where seq = T3.seq+1 and test = T3.test)
and    not exists(select *
                  from   TT T2
                  -- More efficient than "where  T2.date between T1.date
and T3.date"
                  where  T2.seq between T1.seq and T3.seq
                  and    T2.test <> T1.test);

Result:

+------+------+------+
| date | date | test |
+------+------+------+
| 1    | 3    | clim |
| 7    | 10   | amb  |
| 12   | 12   | xxx  |
| 13   | 20   | clim |
| 22   | 25   | amb  |
+------+------+------+

Working table TT is recommended to create an ordered sequence of rows in
which "next" and "previous" rows are more easily described than in the
source table. Avoid "order by" on views. It works in SQLite but it
should not!

The idea is to identify maximal sequences of identical "test" values as
follow:
- T1 denotes the first row of a sequence
- T3 the last row
- T2 any "disturbing" row lying between T1 and T3 but with a different
value of "test"
- first "not exists" condition states that T1 must be the very first of
the sequence: it must not be immediately preceded by a row with same
value of "test"
- same for second "not exists" condition: T3 must be the last
- the third "not exists" condition states that there is no "disturbing"
row between T1 and T3.

Valid if maximal sequences do not overlap. This query also detects
single row sequences (e.g., 'xxx').
An index on TT.test may be useful to support T1*T3 join.

For large tables, an iterative procedure will be faster, though less
elegant!

Regards

Jean-Luc Hainaut

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Group contiguous rows (islands)

Petite Abeille-2
In reply to this post by Clemens Ladisch

> On Feb 15, 2017, at 11:16 AM, Clemens Ladisch <[hidden email]> wrote:
>
> SQLite does not have windowing functions.

A continuous/continual tragedy indeed :|

Still, worthwhile mentioning The Tabibitosan Method, for reference purpose:

http://www.orchestrapit.co.uk/?p=53
https://community.oracle.com/message/3991678

Rather nifty in its simplicity and power. Sadly, out of reach to SQLite dwellers.




_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Group contiguous rows (islands)

Simon Slavin-3

On 15 Feb 2017, at 11:58am, Petite Abeille <[hidden email]> wrote:

> On Feb 15, 2017, at 11:16 AM, Clemens Ladisch <[hidden email]> wrote:
>
>> SQLite does not have windowing functions.
>
> A continuous/continual tragedy indeed :|

Windowing breaks the philosophy behind SQL.  Rows are meant to be members of a set, and your operations on them are meant to be set operations.  There is no implicit order for set elements.  That’s why bare-bones SQL implementations don’t have cursors or windowing.

> Still, worthwhile mentioning The Tabibitosan Method, for reference purpose:
>
> http://www.orchestrapit.co.uk/?p=53
> https://community.oracle.com/message/3991678
>
> Rather nifty in its simplicity and power. Sadly, out of reach to SQLite dwellers.

Actually SQLite can do it, by iterating using recursive common table expressions:

<https://www.sqlite.org/lang_with.html>

.  It looks like an interesting programming exercise, though there’s a danger it will lead to unreadable code (a perpetual danger with WITH).  You would need a spare column in the table to store the 'group' values in.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Group contiguous rows (islands)

Rossel, Jonathan
In reply to this post by Rossel, Jonathan
@ Clemens, Petite Abeille,

Thanks, that's what I thought, but it's comforting to know for sure...

@ Jean-Luc,

Thanks a lot for the detailed answer, that's awesome ! I'll give it a try and see how it compares with an external "manual" grouping


Jonathan



------------------------------

Message: 79
Date: Wed, 15 Feb 2017 11:16:24 +0100
From: Clemens Ladisch <[hidden email]>
To: [hidden email]
Subject: Re: [sqlite] Group contiguous rows (islands)
Message-ID: <[hidden email]>
Content-Type: text/plain; charset=us-ascii

Rossel, Jonathan wrote:
> Other database engines have solutions for this task (like windowing in
> postgre) but I wonder if there is an efficient recipe in SQLite.

SQLite does not have windowing functions.  So the most efficient method
would be to read the data with a simple ORDER BY, and do the grouping
in your code.


Regards,
Clemens


------------------------------

Message: 83
Date: Wed, 15 Feb 2017 12:02:32 +0100
From: Jean-Luc Hainaut <[hidden email]>
To: [hidden email]
Subject: Re: [sqlite] Group contiguous rows (islands)
Message-ID: <[hidden email]>
Content-Type: text/plain; charset=UTF-8; format=flowed


You could try this, inspired by classic algorithms of temporal databases:

create table T(date integer,test char(12));
insert into T
values (1,'clim'),(3,'clim'),(7,'amb'),(10,'amb'),(12,'xxx'),
(13,'clim'),(15,'clim'),(20,'clim'),(22,'amb'),(25,'amb');

create table TT(seq integer not null primary key autoincrement,date
integer,test char(12));
insert into TT(date,test) select * from T order by date;

select T1.date, T3.date, T1.test
from   TT T1, TT T3
-- More efficient than "where  T1.date <= T3.date"
where  T1.seq <= T3.seq
and    T1.test = T3.test
and    not exists(select * from TT where seq = T1.seq-1 and test = T1.test)
and    not exists(select * from TT where seq = T3.seq+1 and test = T3.test)
and    not exists(select *
                  from   TT T2
                  -- More efficient than "where  T2.date between T1.date
and T3.date"
                  where  T2.seq between T1.seq and T3.seq
                  and    T2.test <> T1.test);

Result:

+------+------+------+
| date | date | test |
+------+------+------+
| 1    | 3    | clim |
| 7    | 10   | amb  |
| 12   | 12   | xxx  |
| 13   | 20   | clim |
| 22   | 25   | amb  |
+------+------+------+

Working table TT is recommended to create an ordered sequence of rows in
which "next" and "previous" rows are more easily described than in the
source table. Avoid "order by" on views. It works in SQLite but it
should not!

The idea is to identify maximal sequences of identical "test" values as
follow:
- T1 denotes the first row of a sequence
- T3 the last row
- T2 any "disturbing" row lying between T1 and T3 but with a different
value of "test"
- first "not exists" condition states that T1 must be the very first of
the sequence: it must not be immediately preceded by a row with same
value of "test"
- same for second "not exists" condition: T3 must be the last
- the third "not exists" condition states that there is no "disturbing"
row between T1 and T3.

Valid if maximal sequences do not overlap. This query also detects
single row sequences (e.g., 'xxx').
An index on TT.test may be useful to support T1*T3 join.

For large tables, an iterative procedure will be faster, though less
elegant!

Regards

Jean-Luc Hainaut


------------------------------

Message: 89
Date: Wed, 15 Feb 2017 12:58:07 +0100
From: Petite Abeille <[hidden email]>
To: SQLite mailing list <[hidden email]>
Subject: Re: [sqlite] Group contiguous rows (islands)
Message-ID: <[hidden email]>
Content-Type: text/plain; charset=us-ascii


> On Feb 15, 2017, at 11:16 AM, Clemens Ladisch <[hidden email]> wrote:
>
> SQLite does not have windowing functions.

A continuous/continual tragedy indeed :|

Still, worthwhile mentioning The Tabibitosan Method, for reference purpose:

http://www.orchestrapit.co.uk/?p=53
https://community.oracle.com/message/3991678

Rather nifty in its simplicity and power. Sadly, out of reach to SQLite dwellers.






------------------------------

Subject: Digest Footer

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


------------------------------

End of sqlite-users Digest, Vol 110, Issue 15
*********************************************
*******************************************************************************
This e-mail message is intended only for the addressee(s) and contains
information which may be confidential. If you are not the intended
recipient please do not read, save, forward, disclose or copy the contents
of this e-mail. If this e-mail has been sent to you in error, please delete this
e-mail and any copies or links to this e-mail completely and immediately
from your system. We also like to inform you that communication via e-mail
over the Internet is insecure because third parties may have the possibility
to access and manipulate e-mails.

Any views expressed in this message are those of the individual sender,
except where the sender specifically states them to be the views of
The Swatch Group Ltd.
*******************************************************************************
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Group contiguous rows (islands)

Rossel, Jonathan
In reply to this post by Rossel, Jonathan
@ Simon,

Thanks for the input ! I was afraid someone was going to mention the dreaded recursive CTEs.

Jonathan




*******************************************************************************
This e-mail message is intended only for the addressee(s) and contains
information which may be confidential. If you are not the intended
recipient please do not read, save, forward, disclose or copy the contents
of this e-mail. If this e-mail has been sent to you in error, please delete this
e-mail and any copies or links to this e-mail completely and immediately
from your system. We also like to inform you that communication via e-mail
over the Internet is insecure because third parties may have the possibility
to access and manipulate e-mails.

Any views expressed in this message are those of the individual sender,
except where the sender specifically states them to be the views of
The Swatch Group Ltd.
*******************************************************************************

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Group contiguous rows (islands)

E.Pasma
In reply to this post by Jean-Luc Hainaut
15 feb 2017, Jean-Luc Hainaut:

>
> You could try this, inspired by classic algorithms of temporal  
> databases:
>
> create table T(date integer,test char(12));
> insert into T
> values (1,'clim'),(3,'clim'),(7,'amb'),(10,'amb'),(12,'xxx'),
> (13,'clim'),(15,'clim'),(20,'clim'),(22,'amb'),(25,'amb');
>
> create table TT(seq integer not null primary key autoincrement,date  
> integer,test char(12));
> insert into TT(date,test) select * from T order by date;
>
> select T1.date, T3.date, T1.test
> from   TT T1, TT T3
> -- More efficient than "where  T1.date <= T3.date"
> where  T1.seq <= T3.seq
> and    T1.test = T3.test
> and    not exists(select * from TT where seq = T1.seq-1 and test =  
> T1.test)
> and    not exists(select * from TT where seq = T3.seq+1 and test =  
> T3.test)
> and    not exists(select *
>                 from   TT T2
>                 -- More efficient than "where  T2.date between  
> T1.date and T3.date"
>                 where  T2.seq between T1.seq and T3.seq
>                 and    T2.test <> T1.test);
>
> Result:
>
> +------+------+------+
> | date | date | test |
> +------+------+------+
> | 1    | 3    | clim |
> | 7    | 10   | amb  |
> | 12   | 12   | xxx  |
> | 13   | 20   | clim |
> | 22   | 25   | amb  |
> +------+------+------+
>
Hello,  the query below is simpler. May be slower. But looks pretty  
relational. Thanks, E Pasma.

create table T(date integer,test char(12));
insert into T
values (1,'clim'),(3,'clim'),(7,'amb'),(10,'amb'),(12,'xxx'),
(13,'clim'),(15,'clim'),(20,'clim'),(22,'amb'),(25,'amb');

select min(date) as fromdate, max(date) as enddate, test
from    (--get closest preceeding different key
     select t.*, max(t2.date) as key2
     from t
     left join t t2
     on t2.date<t.date and t2.test<>t.test
     group by t.date
         )
group by key2
;


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Group contiguous rows (islands)

Jean-Luc Hainaut
On 15/02/2017 18:34, E.Pasma wrote:

>
> Hello,  the query below is simpler. May be slower. But looks pretty
> relational. Thanks, E Pasma.
>
> create table T(date integer,test char(12));
> insert into T
> values (1,'clim'),(3,'clim'),(7,'amb'),(10,'amb'),(12,'xxx'),
> (13,'clim'),(15,'clim'),(20,'clim'),(22,'amb'),(25,'amb');
>
> select min(date) as fromdate, max(date) as enddate, test
> from    (--get closest preceeding different key
>     select t.*, max(t2.date) as key2
>     from t
>     left join t t2
>     on t2.date<t.date and t2.test<>t.test
>     group by t.date
>         )
> group by key2

Quite nice solution indeed!
For those who may feel uncomfortable with outer joins, the from clause
could be written as a subquery:

from (select date, test, (select  max(date)
                                           from    t t2
                                           where  t2.date < t.date
                                           and      t2.test <> t.test)
as key2)

Thanks

J-L

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Group contiguous rows (islands)

E.Pasma

Jean-Luc Hainaut:

> On 15/02/2017 18:34, E.Pasma wrote:
>>
>> Hello,  the query below is simpler. May be slower. But looks pretty  
>> relational. Thanks, E Pasma.
>>
>> create table T(date integer,test char(12));
>> insert into T
>> values (1,'clim'),(3,'clim'),(7,'amb'),(10,'amb'),(12,'xxx'),
>> (13,'clim'),(15,'clim'),(20,'clim'),(22,'amb'),(25,'amb');
>>
>> select min(date) as fromdate, max(date) as enddate, test
>> from    (--get closest preceeding different key
>>    select t.*, max(t2.date) as key2
>>    from t
>>    left join t t2
>>    on t2.date<t.date and t2.test<>t.test
>>    group by t.date
>>        )
>> group by key2
>
> Quite nice solution indeed!
> For those who may feel uncomfortable with outer joins, the from  
> clause could be written as a subquery:
>
> from (select date, test, (select  max(date)
>                                          from    t t2
>                                          where  t2.date < t.date
>                                          and      t2.test <> t.test)  
> as key2)
>
> Thanks
>
> J-L
>
this way you may also try to optimise speed by using ORDER BY & LIMIT  
1 instead of MAX

from (select date, test, (select t2.date
                                       from  t t2
                                       where t2.date < t.date
                                       and t2.test <>  t.test
                                       order by t2.date desc limit 1)  
as key2
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Group contiguous rows (islands)

Rossel, Jonathan
In reply to this post by Rossel, Jonathan
@ Pasma and Hainaut,

Thanks again, that looks promising !

Jonathan

Message: 42
Date: Wed, 15 Feb 2017 21:10:10 +0100
From: "E.Pasma" <[hidden email]>
To: SQLite mailing list <[hidden email]>
Subject: Re: [sqlite] Group contiguous rows (islands)
Message-ID: <[hidden email]>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes


Jean-Luc Hainaut:

> On 15/02/2017 18:34, E.Pasma wrote:
>>
>> Hello,  the query below is simpler. May be slower. But looks pretty  
>> relational. Thanks, E Pasma.
>>
>> create table T(date integer,test char(12));
>> insert into T
>> values (1,'clim'),(3,'clim'),(7,'amb'),(10,'amb'),(12,'xxx'),
>> (13,'clim'),(15,'clim'),(20,'clim'),(22,'amb'),(25,'amb');
>>
>> select min(date) as fromdate, max(date) as enddate, test
>> from    (--get closest preceeding different key
>>    select t.*, max(t2.date) as key2
>>    from t
>>    left join t t2
>>    on t2.date<t.date and t2.test<>t.test
>>    group by t.date
>>        )
>> group by key2
>
> Quite nice solution indeed!
> For those who may feel uncomfortable with outer joins, the from  
> clause could be written as a subquery:
>
> from (select date, test, (select  max(date)
>                                          from    t t2
>                                          where  t2.date < t.date
>                                          and      t2.test <> t.test)  
> as key2)
>
> Thanks
>
> J-L
>
this way you may also try to optimise speed by using ORDER BY & LIMIT  
1 instead of MAX

from (select date, test, (select t2.date
                                       from  t t2
                                       where t2.date < t.date
                                       and t2.test <>  t.test
                                       order by t2.date desc limit 1)  
as key2


*******************************************************************************
This e-mail message is intended only for the addressee(s) and contains
information which may be confidential. If you are not the intended
recipient please do not read, save, forward, disclose or copy the contents
of this e-mail. If this e-mail has been sent to you in error, please delete this
e-mail and any copies or links to this e-mail completely and immediately
from your system. We also like to inform you that communication via e-mail
over the Internet is insecure because third parties may have the possibility
to access and manipulate e-mails.

Any views expressed in this message are those of the individual sender,
except where the sender specifically states them to be the views of
The Swatch Group Ltd.
*******************************************************************************
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users