Efficient query to count number of leaves in a DAG.

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Efficient query to count number of leaves in a DAG.

Shane Dev
Hi,

I want to the count the number of leaves (descendants without children) for
each node in a DAG

DAG definition -

CREATE TABLE nodes(id integer primary key, description text);
CREATE TABLE edges(parent not null references nodes, child not null
references nodes, primary key(parent, child));

My query -

CREATE VIEW v_count_leaves as with recursive r(id, top) as (
select id, id from nodes
union all
select t.id, top from nodes as t, edges as e, r where e.parent=r.id and t.id
=e.child)
select top, count(*) from r where top<>id and id not in (select parent from
edges where parent=id) group by top;

It seems to work but is complex to understand and debug despite my aim to
keep it simple as possible, but more importantly - it is very slow when
there are more than a few thousand nodes and edges.

It there a more efficient (and ideally simpler) way?
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient query to count number of leaves in a DAG.

petern
Shane.  I sent you a query to work with the crippled schema and index you
proposed for TABLE edges.
Clemens then explicitly suggested you correct the schema to have use of
automatic covering index.

>CREATE TABLE edges(parent not null references nodes, child not null
>references nodes, primary key(parent, child));

Try your leaf counter again - after making the schema changes Clemens
suggested.

Peter


On Mon, Jan 1, 2018 at 8:13 AM, Shane Dev <[hidden email]> wrote:

> Hi,
>
> I want to the count the number of leaves (descendants without children) for
> each node in a DAG
>
> DAG definition -
>
> CREATE TABLE nodes(id integer primary key, description text);
> CREATE TABLE edges(parent not null references nodes, child not null
> references nodes, primary key(parent, child));
>
> My query -
>
> CREATE VIEW v_count_leaves as with recursive r(id, top) as (
> select id, id from nodes
> union all
> select t.id, top from nodes as t, edges as e, r where e.parent=r.id and
> t.id
> =e.child)
> select top, count(*) from r where top<>id and id not in (select parent from
> edges where parent=id) group by top;
>
> It seems to work but is complex to understand and debug despite my aim to
> keep it simple as possible, but more importantly - it is very slow when
> there are more than a few thousand nodes and edges.
>
> It there a more efficient (and ideally simpler) way?
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient query to count number of leaves in a DAG.

Shane Dev
Hi Peter,

By "schema changes Clemens suggested" I assume you mean replacing the
constraint -

....not exists (select * from edges where child=nodes.id);

with

...where id not in (select child from edges);

For this leaf count query, I need to constrain the result set to exclude
nodes which are parents -

sqlite> CREATE VIEW v_count_leaves as with recursive r(id, top) as (select
id, id from nodes union all select t.id, top from nodes as t, edges as e, r
where e.parent=r.id and t.id=e.child) select top, count(*) from r where
top<>id and id not in (select parent from edges) group by top;
sqlite> .timer on
sqlite> select * from v_count_leaves where top=4465;
4465    1
Run Time: real 75.678 user 75.671875 sys 0.000000

sqlite> select count(*) from nodes;
10000
sqlite> select count(*) from edges;
9986

It is still very slow or did you mean something else?


On 1 January 2018 at 18:04, petern <[hidden email]> wrote:

> Shane.  I sent you a query to work with the crippled schema and index you
> proposed for TABLE edges.
> Clemens then explicitly suggested you correct the schema to have use of
> automatic covering index.
>
> >CREATE TABLE edges(parent not null references nodes, child not null
> >references nodes, primary key(parent, child));
>
> Try your leaf counter again - after making the schema changes Clemens
> suggested.
>
> Peter
>
>
> On Mon, Jan 1, 2018 at 8:13 AM, Shane Dev <[hidden email]> wrote:
>
> > Hi,
> >
> > I want to the count the number of leaves (descendants without children)
> for
> > each node in a DAG
> >
> > DAG definition -
> >
> > CREATE TABLE nodes(id integer primary key, description text);
> > CREATE TABLE edges(parent not null references nodes, child not null
> > references nodes, primary key(parent, child));
> >
> > My query -
> >
> > CREATE VIEW v_count_leaves as with recursive r(id, top) as (
> > select id, id from nodes
> > union all
> > select t.id, top from nodes as t, edges as e, r where e.parent=r.id and
> > t.id
> > =e.child)
> > select top, count(*) from r where top<>id and id not in (select parent
> from
> > edges where parent=id) group by top;
> >
> > It seems to work but is complex to understand and debug despite my aim to
> > keep it simple as possible, but more importantly - it is very slow when
> > there are more than a few thousand nodes and edges.
> >
> > It there a more efficient (and ideally simpler) way?
> > _______________________________________________
> > sqlite-users mailing list
> > [hidden email]
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient query to count number of leaves in a DAG.

Dinu
In reply to this post by Shane Dev
If a different perspective may be helpful to you:
If moving overhead to writes is an option (ie you dont have many or time
critical writes), then the tree descendants problem can be sped up to
stellar speeds by using a path column.

IE.
add a column "path" in the nodes table that would contain something like
"1.2.3" for node 3 that is descendant of node 2 that is descendant of node
1.
Then querying 2's descendant would result in the following range:
"path">='1.2.' AND "path"<'1.2/' or you can try using LIKE semantics - both
can use an index if you are careful with collation; I found out the hard way
that the range queries are more resilient and portable, SQLite and others
have a pretty awkward way of plugging in the LIKE optimisations that may
result in the index being skipped for not-so-obvious reasons.

Inserting nodes is trivial, but moving edges requires an algorithm to update
paths (whenever a node's parent changes, all descendant's paths must be
updated). However, for most real-world ontology use scenarios, this
opperation happens very rarely and usually on the admin range of functions,
so you can afford this operation that can be pretty slow.



--
Sent from: http://sqlite.1065341.n5.nabble.com/
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient query to count number of leaves in a DAG.

Lifepillar
On 02/01/2018 06:54, Dinu wrote:
> If a different perspective may be helpful to you:
> If moving overhead to writes is an option (ie you dont have many or time
> critical writes), then the tree descendants problem can be sped up to
> stellar speeds by using a path column.

In a more relational spirit, you might add a Leaves(node) table to
contain all the leaves (where 'node' is a foreign key to your Nodes
table). Query the leaves becomes trivial then.

Adding a new node will require two inserts, one into Nodes and one into
Leaves. When you add a new edge (u,v), where u and v are existing nodes,
you must check whether u is in Leaves and, if so, delete it from there.
Edge deletions and updates are a bit more involved, but manageable as
well.

In a similar vein, if you often need to query descendants/ancestors of a
node, you might as well maintain the (reflexive)/transitive closure of
your graph explicitly in a separate table TC(ancestor,descendant).
There is a SQLite3 extension called 'closure' which might help with
that as well (I haven't used it, though).

Another idea is to keep the out-degree of each node in the Nodes table,
and update it each time you add/remove edges, e.g., with triggers.

Hope this helps,
Life.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient query to count number of leaves in a DAG.

Dinu
Yes, Lifepillar's way is the more orthodox approach, however I always
preferred the path-based one because:
1) One seldom runs queries only based on the descendants map; there usually
is an "AND -some other conditions-" involved; thus the ability to have one
covering index of the condition comes in very handy
2) In ontologies you very often also require "structural" sorting - such as,
the chapters in a book may need to be read like 1<1.1<1.1.A<1.1.B<1.2 etc;
this is very likely to be needed at some point and only achievable using
some modified form of the path field (with alpa-sortable numbers). This
cannot be achieved with a normalized descendency map.



--
Sent from: http://sqlite.1065341.n5.nabble.com/
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient query to count number of leaves in a DAG.

David Raymond
In reply to this post by Shane Dev
I think you need a union there instead of a union all. Otherwise you're double (or more) counting leaves where there is more than 1 path to get to the leaf.

I don't have a large dataset to test it on, but how about something like:

create table nodes
(
  id integer primary key,
  description text
);

create table edges
(
  parent int not null references nodes,
  child int not null references nodes,
  primary key (parent, child),
  check (parent != child)
) without rowid;
create index reverseEdges on edges (child, parent);

create view leafCounts as with recursive
leaves (id) as (
  select nodes.id
  from nodes left outer join edges
  on nodes.id = edges.parent
  where edges.parent is null
),
paths (parent, child) as (
  select parent, child from edges
  union
  select paths.parent, edges.child
  from paths inner join edges
  on paths.child = edges.parent
)
select parent, count(*) as leafCount
from paths
where child in leaves
group by parent;


-----Original Message-----
From: sqlite-users [mailto:[hidden email]] On Behalf Of Shane Dev
Sent: Monday, January 01, 2018 11:14 AM
To: SQLite mailing list
Subject: [sqlite] Efficient query to count number of leaves in a DAG.

Hi,

I want to the count the number of leaves (descendants without children) for
each node in a DAG

DAG definition -

CREATE TABLE nodes(id integer primary key, description text);
CREATE TABLE edges(parent not null references nodes, child not null
references nodes, primary key(parent, child));

My query -

CREATE VIEW v_count_leaves as with recursive r(id, top) as (
select id, id from nodes
union all
select t.id, top from nodes as t, edges as e, r where e.parent=r.id and t.id
=e.child)
select top, count(*) from r where top<>id and id not in (select parent from
edges where parent=id) group by top;

It seems to work but is complex to understand and debug despite my aim to
keep it simple as possible, but more importantly - it is very slow when
there are more than a few thousand nodes and edges.

It there a more efficient (and ideally simpler) way?
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient query to count number of leaves in a DAG.

Shane Dev
Hi David,

Nice work! your query is far quicker than mine- even without the
reverseEdges index. I think you are right about the problem of potentially
double counting leaves. There weren't any multi-parent nodes in my test
data so I didn't notice this mistake.

Could you please explain why your query is so much faster?

On 2 January 2018 at 17:50, David Raymond <[hidden email]> wrote:

> I think you need a union there instead of a union all. Otherwise you're
> double (or more) counting leaves where there is more than 1 path to get to
> the leaf.
>
> I don't have a large dataset to test it on, but how about something like:
>
> create table nodes
> (
>   id integer primary key,
>   description text
> );
>
> create table edges
> (
>   parent int not null references nodes,
>   child int not null references nodes,
>   primary key (parent, child),
>   check (parent != child)
> ) without rowid;
> create index reverseEdges on edges (child, parent);
>
> create view leafCounts as with recursive
> leaves (id) as (
>   select nodes.id
>   from nodes left outer join edges
>   on nodes.id = edges.parent
>   where edges.parent is null
> ),
> paths (parent, child) as (
>   select parent, child from edges
>   union
>   select paths.parent, edges.child
>   from paths inner join edges
>   on paths.child = edges.parent
> )
> select parent, count(*) as leafCount
> from paths
> where child in leaves
> group by parent;
>
>
> -----Original Message-----
> From: sqlite-users [mailto:[hidden email]]
> On Behalf Of Shane Dev
> Sent: Monday, January 01, 2018 11:14 AM
> To: SQLite mailing list
> Subject: [sqlite] Efficient query to count number of leaves in a DAG.
>
> Hi,
>
> I want to the count the number of leaves (descendants without children) for
> each node in a DAG
>
> DAG definition -
>
> CREATE TABLE nodes(id integer primary key, description text);
> CREATE TABLE edges(parent not null references nodes, child not null
> references nodes, primary key(parent, child));
>
> My query -
>
> CREATE VIEW v_count_leaves as with recursive r(id, top) as (
> select id, id from nodes
> union all
> select t.id, top from nodes as t, edges as e, r where e.parent=r.id and
> t.id
> =e.child)
> select top, count(*) from r where top<>id and id not in (select parent from
> edges where parent=id) group by top;
>
> It seems to work but is complex to understand and debug despite my aim to
> keep it simple as possible, but more importantly - it is very slow when
> there are more than a few thousand nodes and edges.
>
> It there a more efficient (and ideally simpler) way?
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient query to count number of leaves in a DAG.

David Raymond
A couple things...

I recommend using longer names than 1 letter for your aliases, what you save in typing you lose a couple times over again when wondering what "r" is or why "t" has anything to do with "nodes"

In your CTE you're doing a 3 table join. There's no need to include the nodes table in there at all, you can get the node ID from the edge table.
...union all select e.child, top from r, edges as e where e.parent = r.id)...

The big thing though is in the where clause.
where...and id not in (select parent from edges where parent = id)...

The "where parent = id" bit turns that into a <correlated> sub query, meaning it runs it once for each row.
It's kind of saying "where X is in the bucket of (X's that are in the bucket of Y's)"

If you get rid of that...
where...and id not in (select parent from edges)...
It becomes a non-correlated sub query which only has to be run once. "where X is in the bucket of Y's"


Making those 2 tweaks to your query let's look at the plans. You can see we drop the search of the nodes table, and subquery 4 is no longer correlated.

Old:
sqlite> explain query plan with recursive r (id, top) as (select id, id from nodes union all select t.id, top from nodes as t, edges as e, r where e.parent = r.id and t.id = e.child) select top, count(*) from r where top != id and id not in (select parent from edges where parent = id) group by top;
selectid|order|from|detail
2|0|0|SCAN TABLE nodes
3|0|1|SCAN TABLE edges AS e
3|1|0|SEARCH TABLE nodes AS t USING INTEGER PRIMARY KEY (rowid=?)
3|2|2|SCAN TABLE r
1|0|0|COMPOUND SUBQUERIES 0 AND 0 (UNION ALL)
0|0|0|SCAN SUBQUERY 1
0|0|0|EXECUTE CORRELATED LIST SUBQUERY 4
4|0|0|SCAN TABLE edges
0|0|0|USE TEMP B-TREE FOR GROUP BY


New:
sqlite> explain query plan with recursive r (id, top) as (select id, id from nodes union all select e.child, top from edges as e, r where e.parent = r.id) select top, count(*) from r where top != id and id not in (select parent from edges) group by top;
selectid|order|from|detail
2|0|0|SCAN TABLE nodes
3|0|0|SCAN TABLE edges AS e
3|1|1|SCAN TABLE r
1|0|0|COMPOUND SUBQUERIES 0 AND 0 (UNION ALL)
0|0|0|SCAN SUBQUERY 1
0|0|0|EXECUTE LIST SUBQUERY 4
4|0|0|SCAN TABLE edges
0|0|0|USE TEMP B-TREE FOR GROUP BY


Now give your modified query a go and let me know how it compares to what I came up with.


-----Original Message-----
From: sqlite-users [mailto:[hidden email]] On Behalf Of Shane Dev
Sent: Wednesday, January 03, 2018 12:45 AM
To: SQLite mailing list
Subject: Re: [sqlite] Efficient query to count number of leaves in a DAG.

Hi David,

Nice work! your query is far quicker than mine- even without the
reverseEdges index. I think you are right about the problem of potentially
double counting leaves. There weren't any multi-parent nodes in my test
data so I didn't notice this mistake.

Could you please explain why your query is so much faster?

On 2 January 2018 at 17:50, David Raymond <[hidden email]> wrote:

> I think you need a union there instead of a union all. Otherwise you're
> double (or more) counting leaves where there is more than 1 path to get to
> the leaf.
>
> I don't have a large dataset to test it on, but how about something like:
>
> create table nodes
> (
>   id integer primary key,
>   description text
> );
>
> create table edges
> (
>   parent int not null references nodes,
>   child int not null references nodes,
>   primary key (parent, child),
>   check (parent != child)
> ) without rowid;
> create index reverseEdges on edges (child, parent);
>
> create view leafCounts as with recursive
> leaves (id) as (
>   select nodes.id
>   from nodes left outer join edges
>   on nodes.id = edges.parent
>   where edges.parent is null
> ),
> paths (parent, child) as (
>   select parent, child from edges
>   union
>   select paths.parent, edges.child
>   from paths inner join edges
>   on paths.child = edges.parent
> )
> select parent, count(*) as leafCount
> from paths
> where child in leaves
> group by parent;
>
>
> -----Original Message-----
> From: sqlite-users [mailto:[hidden email]]
> On Behalf Of Shane Dev
> Sent: Monday, January 01, 2018 11:14 AM
> To: SQLite mailing list
> Subject: [sqlite] Efficient query to count number of leaves in a DAG.
>
> Hi,
>
> I want to the count the number of leaves (descendants without children) for
> each node in a DAG
>
> DAG definition -
>
> CREATE TABLE nodes(id integer primary key, description text);
> CREATE TABLE edges(parent not null references nodes, child not null
> references nodes, primary key(parent, child));
>
> My query -
>
> CREATE VIEW v_count_leaves as with recursive r(id, top) as (
> select id, id from nodes
> union all
> select t.id, top from nodes as t, edges as e, r where e.parent=r.id and
> t.id
> =e.child)
> select top, count(*) from r where top<>id and id not in (select parent from
> edges where parent=id) group by top;
>
> It seems to work but is complex to understand and debug despite my aim to
> keep it simple as possible, but more importantly - it is very slow when
> there are more than a few thousand nodes and edges.
>
> It there a more efficient (and ideally simpler) way?
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient query to count number of leaves in a DAG.

Stephen Chrzanowski
Playing devils advocate, there may be a purpose to those aliases
(Application code requirement, which can't be changed), and/or, the single
letter aliases are documented and understood throughout the application.

But, you're absolutely correct.  The only time I personally use single
letter variables is when I'm doing tight loops.  Like "for x:=.." and "X"
doesn't go beyond what can be read in 10pt font on a 640x480 screen.  Its a
throw back to my C64 coding days. ;)

On Wed, Jan 3, 2018 at 9:55 AM, David Raymond <[hidden email]>
wrote:

> A couple things...
>
> I recommend using longer names than 1 letter for your aliases, what you
> save in typing you lose a couple times over again when wondering what "r"
> is or why "t" has anything to do with "nodes"
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient query to count number of leaves in a DAG.

Shane Dev
In reply to this post by David Raymond
Hi David

I recommend using longer names than 1 letter for your aliases, what you
> save in typing you lose a couple times over again when wondering what "r"
> is or why "t" has anything to do with "nodes"
>

Fair enough. I tend to use shorts names to reduce the risk of typos. My
original node table was called "tasks". I tried to simplify the query for
this forum post but neglected to change the alias.

>
> In your CTE you're doing a 3 table join. There's no need to include the
> nodes table in there at all, you can get the node ID from the edge table.
> ...union all select e.child, top from r, edges as e where e.parent = r.id
> )...
>

You're right in this case. My original node table "tasks" had more columns
which I wanted in the final result set.

>
> The big thing though is in the where clause.
> where...and id not in (select parent from edges where parent = id)...
>

That was a sloppy mistake, I changed it to  ..and id not in (select parent
from edges)... but it was still very slow

>
> Old:
> sqlite> explain query plan with recursive r (id, top) as (select id, id
> from nodes union all select t.id, top from nodes as t, edges as e, r
> where e.parent = r.id and t.id = e.child) select top, count(*) from r
> where top != id and id not in (select parent from edges where parent = id)
> group by top;
>

CREATE VIEW v_count_leaves as with recursive r (id, top) as (select id, id
from nodes union all select t.id, top from nodes as t, edges as e, r where
e.parent = r.id and t.id = e.child) select top, count(*) from r where top
!= id and id not in (select parent from edges where parent = id) group by
top;

sqlite> select * from v_count_leaves where top=679;
top     count(*)
679     2
Run Time: real 73.365 user 73.328125 sys 0.000000


> New:
> sqlite> explain query plan with recursive r (id, top) as (select id, id
> from nodes union all select e.child, top from edges as e, r where e.parent
> = r.id) select top, count(*) from r where top != id and id not in (select
> parent from edges) group by top;
> Now give your modified query a go and let me know how it compares to what
> I came up with.
>

CREATE VIEW v_count_leaves_new as with recursive r (id, top) as (select id,
id from nodes union all select e.child, top from edges as e, r where
e.parent = r.id) select top, count(*) from r where top != id and id not in
(select parent from edges) group by top;

sqlite> select * from v_count_leaves_new where top=679;
top     count(*)
679     2
Run Time: real 45.099 user 45.093750 sys 0.000000

faster, but about 8 times slower than your query -

sqlite> select * from leafcounts where parent=679;
parent  leafCount
679     2
Run Time: real 5.639 user 5.640625 sys 0.000000

and that is without the reverseEdges index.

I still don't understand why "leafcounts" is so much faster than
"v_count_leaves_new"




> -----Original Message-----
> From: sqlite-users [mailto:[hidden email]]
> On Behalf Of Shane Dev
> Sent: Wednesday, January 03, 2018 12:45 AM
> To: SQLite mailing list
> Subject: Re: [sqlite] Efficient query to count number of leaves in a DAG.
>
> Hi David,
>
> Nice work! your query is far quicker than mine- even without the
> reverseEdges index. I think you are right about the problem of potentially
> double counting leaves. There weren't any multi-parent nodes in my test
> data so I didn't notice this mistake.
>
> Could you please explain why your query is so much faster?
>
> On 2 January 2018 at 17:50, David Raymond <[hidden email]>
> wrote:
>
> > I think you need a union there instead of a union all. Otherwise you're
> > double (or more) counting leaves where there is more than 1 path to get
> to
> > the leaf.
> >
> > I don't have a large dataset to test it on, but how about something like:
> >
> > create table nodes
> > (
> >   id integer primary key,
> >   description text
> > );
> >
> > create table edges
> > (
> >   parent int not null references nodes,
> >   child int not null references nodes,
> >   primary key (parent, child),
> >   check (parent != child)
> > ) without rowid;
> > create index reverseEdges on edges (child, parent);
> >
> > create view leafCounts as with recursive
> > leaves (id) as (
> >   select nodes.id
> >   from nodes left outer join edges
> >   on nodes.id = edges.parent
> >   where edges.parent is null
> > ),
> > paths (parent, child) as (
> >   select parent, child from edges
> >   union
> >   select paths.parent, edges.child
> >   from paths inner join edges
> >   on paths.child = edges.parent
> > )
> > select parent, count(*) as leafCount
> > from paths
> > where child in leaves
> > group by parent;
> >
> >
> > -----Original Message-----
> > From: sqlite-users [mailto:[hidden email]]
> > On Behalf Of Shane Dev
> > Sent: Monday, January 01, 2018 11:14 AM
> > To: SQLite mailing list
> > Subject: [sqlite] Efficient query to count number of leaves in a DAG.
> >
> > Hi,
> >
> > I want to the count the number of leaves (descendants without children)
> for
> > each node in a DAG
> >
> > DAG definition -
> >
> > CREATE TABLE nodes(id integer primary key, description text);
> > CREATE TABLE edges(parent not null references nodes, child not null
> > references nodes, primary key(parent, child));
> >
> > My query -
> >
> > CREATE VIEW v_count_leaves as with recursive r(id, top) as (
> > select id, id from nodes
> > union all
> > select t.id, top from nodes as t, edges as e, r where e.parent=r.id and
> > t.id
> > =e.child)
> > select top, count(*) from r where top<>id and id not in (select parent
> from
> > edges where parent=id) group by top;
> >
> > It seems to work but is complex to understand and debug despite my aim to
> > keep it simple as possible, but more importantly - it is very slow when
> > there are more than a few thousand nodes and edges.
> >
> > It there a more efficient (and ideally simpler) way?
> > _______________________________________________
> > sqlite-users mailing list
> > [hidden email]
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > _______________________________________________
> > sqlite-users mailing list
> > [hidden email]
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient query to count number of leaves in a DAG.

David Raymond
Hmm. Maybe try yours with union instead of union all? Though if there's only 1 path between any pair of nodes that shouldn't make too much difference. Otherwise I'm getting low on ideas.

What're the record counts for nodes and edges?

-----Original Message-----
From: sqlite-users [mailto:[hidden email]] On Behalf Of Shane Dev
Sent: Thursday, January 04, 2018 5:20 PM
To: SQLite mailing list
Subject: Re: [sqlite] Efficient query to count number of leaves in a DAG.

Hi David

I recommend using longer names than 1 letter for your aliases, what you
> save in typing you lose a couple times over again when wondering what "r"
> is or why "t" has anything to do with "nodes"
>

Fair enough. I tend to use shorts names to reduce the risk of typos. My
original node table was called "tasks". I tried to simplify the query for
this forum post but neglected to change the alias.

>
> In your CTE you're doing a 3 table join. There's no need to include the
> nodes table in there at all, you can get the node ID from the edge table.
> ...union all select e.child, top from r, edges as e where e.parent = r.id
> )...
>

You're right in this case. My original node table "tasks" had more columns
which I wanted in the final result set.

>
> The big thing though is in the where clause.
> where...and id not in (select parent from edges where parent = id)...
>

That was a sloppy mistake, I changed it to  ..and id not in (select parent
from edges)... but it was still very slow

>
> Old:
> sqlite> explain query plan with recursive r (id, top) as (select id, id
> from nodes union all select t.id, top from nodes as t, edges as e, r
> where e.parent = r.id and t.id = e.child) select top, count(*) from r
> where top != id and id not in (select parent from edges where parent = id)
> group by top;
>

CREATE VIEW v_count_leaves as with recursive r (id, top) as (select id, id
from nodes union all select t.id, top from nodes as t, edges as e, r where
e.parent = r.id and t.id = e.child) select top, count(*) from r where top
!= id and id not in (select parent from edges where parent = id) group by
top;

sqlite> select * from v_count_leaves where top=679;
top     count(*)
679     2
Run Time: real 73.365 user 73.328125 sys 0.000000


> New:
> sqlite> explain query plan with recursive r (id, top) as (select id, id
> from nodes union all select e.child, top from edges as e, r where e.parent
> = r.id) select top, count(*) from r where top != id and id not in (select
> parent from edges) group by top;
> Now give your modified query a go and let me know how it compares to what
> I came up with.
>

CREATE VIEW v_count_leaves_new as with recursive r (id, top) as (select id,
id from nodes union all select e.child, top from edges as e, r where
e.parent = r.id) select top, count(*) from r where top != id and id not in
(select parent from edges) group by top;

sqlite> select * from v_count_leaves_new where top=679;
top     count(*)
679     2
Run Time: real 45.099 user 45.093750 sys 0.000000

faster, but about 8 times slower than your query -

sqlite> select * from leafcounts where parent=679;
parent  leafCount
679     2
Run Time: real 5.639 user 5.640625 sys 0.000000

and that is without the reverseEdges index.

I still don't understand why "leafcounts" is so much faster than
"v_count_leaves_new"




> -----Original Message-----
> From: sqlite-users [mailto:[hidden email]]
> On Behalf Of Shane Dev
> Sent: Wednesday, January 03, 2018 12:45 AM
> To: SQLite mailing list
> Subject: Re: [sqlite] Efficient query to count number of leaves in a DAG.
>
> Hi David,
>
> Nice work! your query is far quicker than mine- even without the
> reverseEdges index. I think you are right about the problem of potentially
> double counting leaves. There weren't any multi-parent nodes in my test
> data so I didn't notice this mistake.
>
> Could you please explain why your query is so much faster?
>
> On 2 January 2018 at 17:50, David Raymond <[hidden email]>
> wrote:
>
> > I think you need a union there instead of a union all. Otherwise you're
> > double (or more) counting leaves where there is more than 1 path to get
> to
> > the leaf.
> >
> > I don't have a large dataset to test it on, but how about something like:
> >
> > create table nodes
> > (
> >   id integer primary key,
> >   description text
> > );
> >
> > create table edges
> > (
> >   parent int not null references nodes,
> >   child int not null references nodes,
> >   primary key (parent, child),
> >   check (parent != child)
> > ) without rowid;
> > create index reverseEdges on edges (child, parent);
> >
> > create view leafCounts as with recursive
> > leaves (id) as (
> >   select nodes.id
> >   from nodes left outer join edges
> >   on nodes.id = edges.parent
> >   where edges.parent is null
> > ),
> > paths (parent, child) as (
> >   select parent, child from edges
> >   union
> >   select paths.parent, edges.child
> >   from paths inner join edges
> >   on paths.child = edges.parent
> > )
> > select parent, count(*) as leafCount
> > from paths
> > where child in leaves
> > group by parent;
> >
> >
> > -----Original Message-----
> > From: sqlite-users [mailto:[hidden email]]
> > On Behalf Of Shane Dev
> > Sent: Monday, January 01, 2018 11:14 AM
> > To: SQLite mailing list
> > Subject: [sqlite] Efficient query to count number of leaves in a DAG.
> >
> > Hi,
> >
> > I want to the count the number of leaves (descendants without children)
> for
> > each node in a DAG
> >
> > DAG definition -
> >
> > CREATE TABLE nodes(id integer primary key, description text);
> > CREATE TABLE edges(parent not null references nodes, child not null
> > references nodes, primary key(parent, child));
> >
> > My query -
> >
> > CREATE VIEW v_count_leaves as with recursive r(id, top) as (
> > select id, id from nodes
> > union all
> > select t.id, top from nodes as t, edges as e, r where e.parent=r.id and
> > t.id
> > =e.child)
> > select top, count(*) from r where top<>id and id not in (select parent
> from
> > edges where parent=id) group by top;
> >
> > It seems to work but is complex to understand and debug despite my aim to
> > keep it simple as possible, but more importantly - it is very slow when
> > there are more than a few thousand nodes and edges.
> >
> > It there a more efficient (and ideally simpler) way?
> > _______________________________________________
> > sqlite-users mailing list
> > [hidden email]
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > _______________________________________________
> > sqlite-users mailing list
> > [hidden email]
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient query to count number of leaves in a DAG.

Shane Dev
Hi David,

According to https://sqlite.org/lang_with.html, "Optimization note: ...if
the example had used UNION instead of UNION ALL, then SQLite would have had
to keep around all previously generated content in order to check for
duplicates. For this reason, programmers should strive to use UNION ALL
instead of UNION when feasible."

Despite that, your RCTE with UNION is much faster than mine.

sqlite> select count(*) from nodes;
count(*)
10000
sqlite> select count(*) from edges;
count(*)
9990

Here is how create my test data -

sqlite> .sch v_generate_nodes
-- Generates an infinite series of x, 'nodex' records where x = 1, 2, 3 ...
CREATE VIEW v_generate_nodes as with recursive rcte(id, description) as
(select 1, 'node1' union all select id+1, 'node'||(id+1) from rcte) select
* from rcte;
sqlite> insert into nodes select from v_generate_nodes limit 10000;

sqlite> .sch v_generate_edges
-- Randomly generates edges between entries in the nodes table.
---Assumption : node ids are 1, 2, 3...n without gaps
-- Each node will have 0 or 1 parents and 0, 1, 2, ... children
CREATE VIEW v_generate_edges as with rcte(parent, child) as (select
cast(abs(random())/9223372036854775808 as integer), 1 union all select
cast(abs(random())/9223372036854775808*(child+1) as integer), child+1 from
rcte where child <= (select count(*) from nodes) limit (select count(*)
from nodes)) select * from rcte where parent>0;
sqlite> insert into edges select * from v_generate_edges;



On 5 January 2018 at 18:32, David Raymond <[hidden email]> wrote:

> Hmm. Maybe try yours with union instead of union all? Though if there's
> only 1 path between any pair of nodes that shouldn't make too much
> difference. Otherwise I'm getting low on ideas.
>
> What're the record counts for nodes and edges?
>
> -----Original Message-----
> From: sqlite-users [mailto:[hidden email]]
> On Behalf Of Shane Dev
> Sent: Thursday, January 04, 2018 5:20 PM
> To: SQLite mailing list
> Subject: Re: [sqlite] Efficient query to count number of leaves in a DAG.
>
> Hi David
>
> I recommend using longer names than 1 letter for your aliases, what you
> > save in typing you lose a couple times over again when wondering what "r"
> > is or why "t" has anything to do with "nodes"
> >
>
> Fair enough. I tend to use shorts names to reduce the risk of typos. My
> original node table was called "tasks". I tried to simplify the query for
> this forum post but neglected to change the alias.
>
> >
> > In your CTE you're doing a 3 table join. There's no need to include the
> > nodes table in there at all, you can get the node ID from the edge table.
> > ...union all select e.child, top from r, edges as e where e.parent =
> r.id
> > )...
> >
>
> You're right in this case. My original node table "tasks" had more columns
> which I wanted in the final result set.
>
> >
> > The big thing though is in the where clause.
> > where...and id not in (select parent from edges where parent = id)...
> >
>
> That was a sloppy mistake, I changed it to  ..and id not in (select parent
> from edges)... but it was still very slow
>
> >
> > Old:
> > sqlite> explain query plan with recursive r (id, top) as (select id, id
> > from nodes union all select t.id, top from nodes as t, edges as e, r
> > where e.parent = r.id and t.id = e.child) select top, count(*) from r
> > where top != id and id not in (select parent from edges where parent =
> id)
> > group by top;
> >
>
> CREATE VIEW v_count_leaves as with recursive r (id, top) as (select id, id
> from nodes union all select t.id, top from nodes as t, edges as e, r where
> e.parent = r.id and t.id = e.child) select top, count(*) from r where top
> != id and id not in (select parent from edges where parent = id) group by
> top;
>
> sqlite> select * from v_count_leaves where top=679;
> top     count(*)
> 679     2
> Run Time: real 73.365 user 73.328125 sys 0.000000
>
>
> > New:
> > sqlite> explain query plan with recursive r (id, top) as (select id, id
> > from nodes union all select e.child, top from edges as e, r where
> e.parent
> > = r.id) select top, count(*) from r where top != id and id not in
> (select
> > parent from edges) group by top;
> > Now give your modified query a go and let me know how it compares to what
> > I came up with.
> >
>
> CREATE VIEW v_count_leaves_new as with recursive r (id, top) as (select id,
> id from nodes union all select e.child, top from edges as e, r where
> e.parent = r.id) select top, count(*) from r where top != id and id not in
> (select parent from edges) group by top;
>
> sqlite> select * from v_count_leaves_new where top=679;
> top     count(*)
> 679     2
> Run Time: real 45.099 user 45.093750 sys 0.000000
>
> faster, but about 8 times slower than your query -
>
> sqlite> select * from leafcounts where parent=679;
> parent  leafCount
> 679     2
> Run Time: real 5.639 user 5.640625 sys 0.000000
>
> and that is without the reverseEdges index.
>
> I still don't understand why "leafcounts" is so much faster than
> "v_count_leaves_new"
>
>
>
>
> > -----Original Message-----
> > From: sqlite-users [mailto:[hidden email]]
> > On Behalf Of Shane Dev
> > Sent: Wednesday, January 03, 2018 12:45 AM
> > To: SQLite mailing list
> > Subject: Re: [sqlite] Efficient query to count number of leaves in a DAG.
> >
> > Hi David,
> >
> > Nice work! your query is far quicker than mine- even without the
> > reverseEdges index. I think you are right about the problem of
> potentially
> > double counting leaves. There weren't any multi-parent nodes in my test
> > data so I didn't notice this mistake.
> >
> > Could you please explain why your query is so much faster?
> >
> > On 2 January 2018 at 17:50, David Raymond <[hidden email]>
> > wrote:
> >
> > > I think you need a union there instead of a union all. Otherwise you're
> > > double (or more) counting leaves where there is more than 1 path to get
> > to
> > > the leaf.
> > >
> > > I don't have a large dataset to test it on, but how about something
> like:
> > >
> > > create table nodes
> > > (
> > >   id integer primary key,
> > >   description text
> > > );
> > >
> > > create table edges
> > > (
> > >   parent int not null references nodes,
> > >   child int not null references nodes,
> > >   primary key (parent, child),
> > >   check (parent != child)
> > > ) without rowid;
> > > create index reverseEdges on edges (child, parent);
> > >
> > > create view leafCounts as with recursive
> > > leaves (id) as (
> > >   select nodes.id
> > >   from nodes left outer join edges
> > >   on nodes.id = edges.parent
> > >   where edges.parent is null
> > > ),
> > > paths (parent, child) as (
> > >   select parent, child from edges
> > >   union
> > >   select paths.parent, edges.child
> > >   from paths inner join edges
> > >   on paths.child = edges.parent
> > > )
> > > select parent, count(*) as leafCount
> > > from paths
> > > where child in leaves
> > > group by parent;
> > >
> > >
> > > -----Original Message-----
> > > From: sqlite-users [mailto:sqlite-users-bounces@
> mailinglists.sqlite.org]
> > > On Behalf Of Shane Dev
> > > Sent: Monday, January 01, 2018 11:14 AM
> > > To: SQLite mailing list
> > > Subject: [sqlite] Efficient query to count number of leaves in a DAG.
> > >
> > > Hi,
> > >
> > > I want to the count the number of leaves (descendants without children)
> > for
> > > each node in a DAG
> > >
> > > DAG definition -
> > >
> > > CREATE TABLE nodes(id integer primary key, description text);
> > > CREATE TABLE edges(parent not null references nodes, child not null
> > > references nodes, primary key(parent, child));
> > >
> > > My query -
> > >
> > > CREATE VIEW v_count_leaves as with recursive r(id, top) as (
> > > select id, id from nodes
> > > union all
> > > select t.id, top from nodes as t, edges as e, r where e.parent=r.id
> and
> > > t.id
> > > =e.child)
> > > select top, count(*) from r where top<>id and id not in (select parent
> > from
> > > edges where parent=id) group by top;
> > >
> > > It seems to work but is complex to understand and debug despite my aim
> to
> > > keep it simple as possible, but more importantly - it is very slow when
> > > there are more than a few thousand nodes and edges.
> > >
> > > It there a more efficient (and ideally simpler) way?
> > > _______________________________________________
> > > sqlite-users mailing list
> > > [hidden email]
> > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > > _______________________________________________
> > > sqlite-users mailing list
> > > [hidden email]
> > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > >
> > _______________________________________________
> > sqlite-users mailing list
> > [hidden email]
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > _______________________________________________
> > sqlite-users mailing list
> > [hidden email]
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient query to count number of leaves in a DAG.

David Raymond
Something is seriously funky here. I'm getting the opposite, where your query appears to be going faster than mine. I used your queries there to populate nodes and edges, based on 1,000,000 nodes. I even added in the extra index which turns out isn't used anyway. With it all in memory my version is taking 59 seconds, whereas your _new version is taking 28 seconds and the old version is only 34. So apparently I should be taking query advice from you.

If I change my union into a union all it goes down to 31 seconds, so closer to yours.
If I change your union all into a union the time jumps to 156 seconds.
I think I was thinking of a graph with possible loops or multiple paths to get from A to B, which is why I went with the union.

So my next question is: what SQLite version are you using, and what hardware are you on?

Are you query plans looking like what I'm seeing here?

sqlite> select * from v_count_leaves_new where top = 777;
--EQP-- 3,0,0,SCAN TABLE nodes
--EQP-- 4,0,1,SCAN TABLE r
--EQP-- 4,1,0,SEARCH TABLE edges AS e USING COVERING INDEX sqlite_autoindex_edges_1 (parent=?)
--EQP-- 2,0,0,COMPOUND SUBQUERIES 0 AND 0 (UNION ALL)
--EQP-- 1,0,0,SCAN SUBQUERY 2
--EQP-- 0,0,0,USING INDEX sqlite_autoindex_edges_1 FOR IN-OPERATOR
--EQP-- 0,0,0,SCAN SUBQUERY 1
top|count(*)
777|314
Run Time: real 28.502 user 28.454582 sys 0.000000

--now with union all
sqlite> select * from leafCounts2 where parent = 777;
--EQP-- 3,0,0,SCAN TABLE edges
--EQP-- 4,0,0,SCAN TABLE paths
--EQP-- 4,1,1,SEARCH TABLE edges USING COVERING INDEX sqlite_autoindex_edges_1 (parent=?)
--EQP-- 2,0,0,COMPOUND SUBQUERIES 0 AND 0 (UNION ALL)
--EQP-- 1,0,0,SCAN SUBQUERY 2
--EQP-- 1,0,0,EXECUTE LIST SUBQUERY 5
--EQP-- 5,0,0,SCAN TABLE nodes
--EQP-- 5,1,1,SEARCH TABLE edges USING COVERING INDEX sqlite_autoindex_edges_1 (parent=?)
--EQP-- 0,0,0,SCAN SUBQUERY 1
parent|leafCount
777|314
Run Time: real 31.590 user 31.434202 sys 0.000000


-----Original Message-----
From: sqlite-users [mailto:[hidden email]] On Behalf Of Shane Dev
Sent: Friday, January 05, 2018 4:57 PM
To: SQLite mailing list
Subject: Re: [sqlite] Efficient query to count number of leaves in a DAG.

Hi David,

According to https://sqlite.org/lang_with.html, "Optimization note: ...if
the example had used UNION instead of UNION ALL, then SQLite would have had
to keep around all previously generated content in order to check for
duplicates. For this reason, programmers should strive to use UNION ALL
instead of UNION when feasible."

Despite that, your RCTE with UNION is much faster than mine.

sqlite> select count(*) from nodes;
count(*)
10000
sqlite> select count(*) from edges;
count(*)
9990

Here is how create my test data -

sqlite> .sch v_generate_nodes
-- Generates an infinite series of x, 'nodex' records where x = 1, 2, 3 ...
CREATE VIEW v_generate_nodes as with recursive rcte(id, description) as
(select 1, 'node1' union all select id+1, 'node'||(id+1) from rcte) select
* from rcte;
sqlite> insert into nodes select from v_generate_nodes limit 10000;

sqlite> .sch v_generate_edges
-- Randomly generates edges between entries in the nodes table.
---Assumption : node ids are 1, 2, 3...n without gaps
-- Each node will have 0 or 1 parents and 0, 1, 2, ... children
CREATE VIEW v_generate_edges as with rcte(parent, child) as (select
cast(abs(random())/9223372036854775808 as integer), 1 union all select
cast(abs(random())/9223372036854775808*(child+1) as integer), child+1 from
rcte where child <= (select count(*) from nodes) limit (select count(*)
from nodes)) select * from rcte where parent>0;
sqlite> insert into edges select * from v_generate_edges;



On 5 January 2018 at 18:32, David Raymond <[hidden email]> wrote:

> Hmm. Maybe try yours with union instead of union all? Though if there's
> only 1 path between any pair of nodes that shouldn't make too much
> difference. Otherwise I'm getting low on ideas.
>
> What're the record counts for nodes and edges?
>
> -----Original Message-----
> From: sqlite-users [mailto:[hidden email]]
> On Behalf Of Shane Dev
> Sent: Thursday, January 04, 2018 5:20 PM
> To: SQLite mailing list
> Subject: Re: [sqlite] Efficient query to count number of leaves in a DAG.
>
> Hi David
>
> I recommend using longer names than 1 letter for your aliases, what you
> > save in typing you lose a couple times over again when wondering what "r"
> > is or why "t" has anything to do with "nodes"
> >
>
> Fair enough. I tend to use shorts names to reduce the risk of typos. My
> original node table was called "tasks". I tried to simplify the query for
> this forum post but neglected to change the alias.
>
> >
> > In your CTE you're doing a 3 table join. There's no need to include the
> > nodes table in there at all, you can get the node ID from the edge table.
> > ...union all select e.child, top from r, edges as e where e.parent =
> r.id
> > )...
> >
>
> You're right in this case. My original node table "tasks" had more columns
> which I wanted in the final result set.
>
> >
> > The big thing though is in the where clause.
> > where...and id not in (select parent from edges where parent = id)...
> >
>
> That was a sloppy mistake, I changed it to  ..and id not in (select parent
> from edges)... but it was still very slow
>
> >
> > Old:
> > sqlite> explain query plan with recursive r (id, top) as (select id, id
> > from nodes union all select t.id, top from nodes as t, edges as e, r
> > where e.parent = r.id and t.id = e.child) select top, count(*) from r
> > where top != id and id not in (select parent from edges where parent =
> id)
> > group by top;
> >
>
> CREATE VIEW v_count_leaves as with recursive r (id, top) as (select id, id
> from nodes union all select t.id, top from nodes as t, edges as e, r where
> e.parent = r.id and t.id = e.child) select top, count(*) from r where top
> != id and id not in (select parent from edges where parent = id) group by
> top;
>
> sqlite> select * from v_count_leaves where top=679;
> top     count(*)
> 679     2
> Run Time: real 73.365 user 73.328125 sys 0.000000
>
>
> > New:
> > sqlite> explain query plan with recursive r (id, top) as (select id, id
> > from nodes union all select e.child, top from edges as e, r where
> e.parent
> > = r.id) select top, count(*) from r where top != id and id not in
> (select
> > parent from edges) group by top;
> > Now give your modified query a go and let me know how it compares to what
> > I came up with.
> >
>
> CREATE VIEW v_count_leaves_new as with recursive r (id, top) as (select id,
> id from nodes union all select e.child, top from edges as e, r where
> e.parent = r.id) select top, count(*) from r where top != id and id not in
> (select parent from edges) group by top;
>
> sqlite> select * from v_count_leaves_new where top=679;
> top     count(*)
> 679     2
> Run Time: real 45.099 user 45.093750 sys 0.000000
>
> faster, but about 8 times slower than your query -
>
> sqlite> select * from leafcounts where parent=679;
> parent  leafCount
> 679     2
> Run Time: real 5.639 user 5.640625 sys 0.000000
>
> and that is without the reverseEdges index.
>
> I still don't understand why "leafcounts" is so much faster than
> "v_count_leaves_new"
>
>
>
>
> > -----Original Message-----
> > From: sqlite-users [mailto:[hidden email]]
> > On Behalf Of Shane Dev
> > Sent: Wednesday, January 03, 2018 12:45 AM
> > To: SQLite mailing list
> > Subject: Re: [sqlite] Efficient query to count number of leaves in a DAG.
> >
> > Hi David,
> >
> > Nice work! your query is far quicker than mine- even without the
> > reverseEdges index. I think you are right about the problem of
> potentially
> > double counting leaves. There weren't any multi-parent nodes in my test
> > data so I didn't notice this mistake.
> >
> > Could you please explain why your query is so much faster?
> >
> > On 2 January 2018 at 17:50, David Raymond <[hidden email]>
> > wrote:
> >
> > > I think you need a union there instead of a union all. Otherwise you're
> > > double (or more) counting leaves where there is more than 1 path to get
> > to
> > > the leaf.
> > >
> > > I don't have a large dataset to test it on, but how about something
> like:
> > >
> > > create table nodes
> > > (
> > >   id integer primary key,
> > >   description text
> > > );
> > >
> > > create table edges
> > > (
> > >   parent int not null references nodes,
> > >   child int not null references nodes,
> > >   primary key (parent, child),
> > >   check (parent != child)
> > > ) without rowid;
> > > create index reverseEdges on edges (child, parent);
> > >
> > > create view leafCounts as with recursive
> > > leaves (id) as (
> > >   select nodes.id
> > >   from nodes left outer join edges
> > >   on nodes.id = edges.parent
> > >   where edges.parent is null
> > > ),
> > > paths (parent, child) as (
> > >   select parent, child from edges
> > >   union
> > >   select paths.parent, edges.child
> > >   from paths inner join edges
> > >   on paths.child = edges.parent
> > > )
> > > select parent, count(*) as leafCount
> > > from paths
> > > where child in leaves
> > > group by parent;
> > >
> > >
> > > -----Original Message-----
> > > From: sqlite-users [mailto:sqlite-users-bounces@
> mailinglists.sqlite.org]
> > > On Behalf Of Shane Dev
> > > Sent: Monday, January 01, 2018 11:14 AM
> > > To: SQLite mailing list
> > > Subject: [sqlite] Efficient query to count number of leaves in a DAG.
> > >
> > > Hi,
> > >
> > > I want to the count the number of leaves (descendants without children)
> > for
> > > each node in a DAG
> > >
> > > DAG definition -
> > >
> > > CREATE TABLE nodes(id integer primary key, description text);
> > > CREATE TABLE edges(parent not null references nodes, child not null
> > > references nodes, primary key(parent, child));
> > >
> > > My query -
> > >
> > > CREATE VIEW v_count_leaves as with recursive r(id, top) as (
> > > select id, id from nodes
> > > union all
> > > select t.id, top from nodes as t, edges as e, r where e.parent=r.id
> and
> > > t.id
> > > =e.child)
> > > select top, count(*) from r where top<>id and id not in (select parent
> > from
> > > edges where parent=id) group by top;
> > >
> > > It seems to work but is complex to understand and debug despite my aim
> to
> > > keep it simple as possible, but more importantly - it is very slow when
> > > there are more than a few thousand nodes and edges.
> > >
> > > It there a more efficient (and ideally simpler) way?
> > > _______________________________________________
> > > sqlite-users mailing list
> > > [hidden email]
> > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > > _______________________________________________
> > > sqlite-users mailing list
> > > [hidden email]
> > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > >
> > _______________________________________________
> > sqlite-users mailing list
> > [hidden email]
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > _______________________________________________
> > sqlite-users mailing list
> > [hidden email]
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient query to count number of leaves in a DAG.

Shane Dev
Hi David,

I started from scratch with a new database and confirmed your findings -
v_count_leaves_new is actually faster than  leafCounts.

My error in the original database was neglecting to specify type of the
columns in the EDGES table -

CREATE TABLE edges(parent not null references nodes, child not null
references nodes, primary key(parent, child)) without rowid;

I had assumed the EDGES columns would inherit the affinity of the
referenced table key (NODES.id). However after correcting the table
definition to

CREATE TABLE edges(parent *int* not null references nodes, child *int* not
null references nodes, primary key(parent, child)) without rowid;

queries over 10000 nodes and edges dropped from 45 to 0.1 sec. A valuable
lesson for me!

On 6 January 2018 at 00:28, David Raymond <[hidden email]> wrote:

> Something is seriously funky here. I'm getting the opposite, where your
> query appears to be going faster than mine. I used your queries there to
> populate nodes and edges, based on 1,000,000 nodes. I even added in the
> extra index which turns out isn't used anyway. With it all in memory my
> version is taking 59 seconds, whereas your _new version is taking 28
> seconds and the old version is only 34. So apparently I should be taking
> query advice from you.
>
> If I change my union into a union all it goes down to 31 seconds, so
> closer to yours.
> If I change your union all into a union the time jumps to 156 seconds.
> I think I was thinking of a graph with possible loops or multiple paths to
> get from A to B, which is why I went with the union.
>
> So my next question is: what SQLite version are you using, and what
> hardware are you on?
>
> Are you query plans looking like what I'm seeing here?
>
> sqlite> select * from v_count_leaves_new where top = 777;
> --EQP-- 3,0,0,SCAN TABLE nodes
> --EQP-- 4,0,1,SCAN TABLE r
> --EQP-- 4,1,0,SEARCH TABLE edges AS e USING COVERING INDEX
> sqlite_autoindex_edges_1 (parent=?)
> --EQP-- 2,0,0,COMPOUND SUBQUERIES 0 AND 0 (UNION ALL)
> --EQP-- 1,0,0,SCAN SUBQUERY 2
> --EQP-- 0,0,0,USING INDEX sqlite_autoindex_edges_1 FOR IN-OPERATOR
> --EQP-- 0,0,0,SCAN SUBQUERY 1
> top|count(*)
> 777|314
> Run Time: real 28.502 user 28.454582 sys 0.000000
>
> --now with union all
> sqlite> select * from leafCounts2 where parent = 777;
> --EQP-- 3,0,0,SCAN TABLE edges
> --EQP-- 4,0,0,SCAN TABLE paths
> --EQP-- 4,1,1,SEARCH TABLE edges USING COVERING INDEX
> sqlite_autoindex_edges_1 (parent=?)
> --EQP-- 2,0,0,COMPOUND SUBQUERIES 0 AND 0 (UNION ALL)
> --EQP-- 1,0,0,SCAN SUBQUERY 2
> --EQP-- 1,0,0,EXECUTE LIST SUBQUERY 5
> --EQP-- 5,0,0,SCAN TABLE nodes
> --EQP-- 5,1,1,SEARCH TABLE edges USING COVERING INDEX
> sqlite_autoindex_edges_1 (parent=?)
> --EQP-- 0,0,0,SCAN SUBQUERY 1
> parent|leafCount
> 777|314
> Run Time: real 31.590 user 31.434202 sys 0.000000
>
>
> -----Original Message-----
> From: sqlite-users [mailto:[hidden email]]
> On Behalf Of Shane Dev
> Sent: Friday, January 05, 2018 4:57 PM
> To: SQLite mailing list
> Subject: Re: [sqlite] Efficient query to count number of leaves in a DAG.
>
> Hi David,
>
> According to https://sqlite.org/lang_with.html, "Optimization note: ...if
> the example had used UNION instead of UNION ALL, then SQLite would have had
> to keep around all previously generated content in order to check for
> duplicates. For this reason, programmers should strive to use UNION ALL
> instead of UNION when feasible."
>
> Despite that, your RCTE with UNION is much faster than mine.
>
> sqlite> select count(*) from nodes;
> count(*)
> 10000
> sqlite> select count(*) from edges;
> count(*)
> 9990
>
> Here is how create my test data -
>
> sqlite> .sch v_generate_nodes
> -- Generates an infinite series of x, 'nodex' records where x = 1, 2, 3 ...
> CREATE VIEW v_generate_nodes as with recursive rcte(id, description) as
> (select 1, 'node1' union all select id+1, 'node'||(id+1) from rcte) select
> * from rcte;
> sqlite> insert into nodes select from v_generate_nodes limit 10000;
>
> sqlite> .sch v_generate_edges
> -- Randomly generates edges between entries in the nodes table.
> ---Assumption : node ids are 1, 2, 3...n without gaps
> -- Each node will have 0 or 1 parents and 0, 1, 2, ... children
> CREATE VIEW v_generate_edges as with rcte(parent, child) as (select
> cast(abs(random())/9223372036854775808 as integer), 1 union all select
> cast(abs(random())/9223372036854775808*(child+1) as integer), child+1 from
> rcte where child <= (select count(*) from nodes) limit (select count(*)
> from nodes)) select * from rcte where parent>0;
> sqlite> insert into edges select * from v_generate_edges;
>
>
>
> On 5 January 2018 at 18:32, David Raymond <[hidden email]>
> wrote:
>
> > Hmm. Maybe try yours with union instead of union all? Though if there's
> > only 1 path between any pair of nodes that shouldn't make too much
> > difference. Otherwise I'm getting low on ideas.
> >
> > What're the record counts for nodes and edges?
> >
> > -----Original Message-----
> > From: sqlite-users [mailto:[hidden email]]
> > On Behalf Of Shane Dev
> > Sent: Thursday, January 04, 2018 5:20 PM
> > To: SQLite mailing list
> > Subject: Re: [sqlite] Efficient query to count number of leaves in a DAG.
> >
> > Hi David
> >
> > I recommend using longer names than 1 letter for your aliases, what you
> > > save in typing you lose a couple times over again when wondering what
> "r"
> > > is or why "t" has anything to do with "nodes"
> > >
> >
> > Fair enough. I tend to use shorts names to reduce the risk of typos. My
> > original node table was called "tasks". I tried to simplify the query for
> > this forum post but neglected to change the alias.
> >
> > >
> > > In your CTE you're doing a 3 table join. There's no need to include the
> > > nodes table in there at all, you can get the node ID from the edge
> table.
> > > ...union all select e.child, top from r, edges as e where e.parent =
> > r.id
> > > )...
> > >
> >
> > You're right in this case. My original node table "tasks" had more
> columns
> > which I wanted in the final result set.
> >
> > >
> > > The big thing though is in the where clause.
> > > where...and id not in (select parent from edges where parent = id)...
> > >
> >
> > That was a sloppy mistake, I changed it to  ..and id not in (select
> parent
> > from edges)... but it was still very slow
> >
> > >
> > > Old:
> > > sqlite> explain query plan with recursive r (id, top) as (select id, id
> > > from nodes union all select t.id, top from nodes as t, edges as e, r
> > > where e.parent = r.id and t.id = e.child) select top, count(*) from r
> > > where top != id and id not in (select parent from edges where parent =
> > id)
> > > group by top;
> > >
> >
> > CREATE VIEW v_count_leaves as with recursive r (id, top) as (select id,
> id
> > from nodes union all select t.id, top from nodes as t, edges as e, r
> where
> > e.parent = r.id and t.id = e.child) select top, count(*) from r where
> top
> > != id and id not in (select parent from edges where parent = id) group by
> > top;
> >
> > sqlite> select * from v_count_leaves where top=679;
> > top     count(*)
> > 679     2
> > Run Time: real 73.365 user 73.328125 sys 0.000000
> >
> >
> > > New:
> > > sqlite> explain query plan with recursive r (id, top) as (select id, id
> > > from nodes union all select e.child, top from edges as e, r where
> > e.parent
> > > = r.id) select top, count(*) from r where top != id and id not in
> > (select
> > > parent from edges) group by top;
> > > Now give your modified query a go and let me know how it compares to
> what
> > > I came up with.
> > >
> >
> > CREATE VIEW v_count_leaves_new as with recursive r (id, top) as (select
> id,
> > id from nodes union all select e.child, top from edges as e, r where
> > e.parent = r.id) select top, count(*) from r where top != id and id not
> in
> > (select parent from edges) group by top;
> >
> > sqlite> select * from v_count_leaves_new where top=679;
> > top     count(*)
> > 679     2
> > Run Time: real 45.099 user 45.093750 sys 0.000000
> >
> > faster, but about 8 times slower than your query -
> >
> > sqlite> select * from leafcounts where parent=679;
> > parent  leafCount
> > 679     2
> > Run Time: real 5.639 user 5.640625 sys 0.000000
> >
> > and that is without the reverseEdges index.
> >
> > I still don't understand why "leafcounts" is so much faster than
> > "v_count_leaves_new"
> >
> >
> >
> >
> > > -----Original Message-----
> > > From: sqlite-users [mailto:sqlite-users-bounces@
> mailinglists.sqlite.org]
> > > On Behalf Of Shane Dev
> > > Sent: Wednesday, January 03, 2018 12:45 AM
> > > To: SQLite mailing list
> > > Subject: Re: [sqlite] Efficient query to count number of leaves in a
> DAG.
> > >
> > > Hi David,
> > >
> > > Nice work! your query is far quicker than mine- even without the
> > > reverseEdges index. I think you are right about the problem of
> > potentially
> > > double counting leaves. There weren't any multi-parent nodes in my test
> > > data so I didn't notice this mistake.
> > >
> > > Could you please explain why your query is so much faster?
> > >
> > > On 2 January 2018 at 17:50, David Raymond <[hidden email]>
> > > wrote:
> > >
> > > > I think you need a union there instead of a union all. Otherwise
> you're
> > > > double (or more) counting leaves where there is more than 1 path to
> get
> > > to
> > > > the leaf.
> > > >
> > > > I don't have a large dataset to test it on, but how about something
> > like:
> > > >
> > > > create table nodes
> > > > (
> > > >   id integer primary key,
> > > >   description text
> > > > );
> > > >
> > > > create table edges
> > > > (
> > > >   parent int not null references nodes,
> > > >   child int not null references nodes,
> > > >   primary key (parent, child),
> > > >   check (parent != child)
> > > > ) without rowid;
> > > > create index reverseEdges on edges (child, parent);
> > > >
> > > > create view leafCounts as with recursive
> > > > leaves (id) as (
> > > >   select nodes.id
> > > >   from nodes left outer join edges
> > > >   on nodes.id = edges.parent
> > > >   where edges.parent is null
> > > > ),
> > > > paths (parent, child) as (
> > > >   select parent, child from edges
> > > >   union
> > > >   select paths.parent, edges.child
> > > >   from paths inner join edges
> > > >   on paths.child = edges.parent
> > > > )
> > > > select parent, count(*) as leafCount
> > > > from paths
> > > > where child in leaves
> > > > group by parent;
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: sqlite-users [mailto:sqlite-users-bounces@
> > mailinglists.sqlite.org]
> > > > On Behalf Of Shane Dev
> > > > Sent: Monday, January 01, 2018 11:14 AM
> > > > To: SQLite mailing list
> > > > Subject: [sqlite] Efficient query to count number of leaves in a DAG.
> > > >
> > > > Hi,
> > > >
> > > > I want to the count the number of leaves (descendants without
> children)
> > > for
> > > > each node in a DAG
> > > >
> > > > DAG definition -
> > > >
> > > > CREATE TABLE nodes(id integer primary key, description text);
> > > > CREATE TABLE edges(parent not null references nodes, child not null
> > > > references nodes, primary key(parent, child));
> > > >
> > > > My query -
> > > >
> > > > CREATE VIEW v_count_leaves as with recursive r(id, top) as (
> > > > select id, id from nodes
> > > > union all
> > > > select t.id, top from nodes as t, edges as e, r where e.parent=r.id
> > and
> > > > t.id
> > > > =e.child)
> > > > select top, count(*) from r where top<>id and id not in (select
> parent
> > > from
> > > > edges where parent=id) group by top;
> > > >
> > > > It seems to work but is complex to understand and debug despite my
> aim
> > to
> > > > keep it simple as possible, but more importantly - it is very slow
> when
> > > > there are more than a few thousand nodes and edges.
> > > >
> > > > It there a more efficient (and ideally simpler) way?
> > > > _______________________________________________
> > > > sqlite-users mailing list
> > > > [hidden email]
> > > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > > > _______________________________________________
> > > > sqlite-users mailing list
> > > > [hidden email]
> > > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > > >
> > > _______________________________________________
> > > sqlite-users mailing list
> > > [hidden email]
> > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > > _______________________________________________
> > > sqlite-users mailing list
> > > [hidden email]
> > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > >
> > _______________________________________________
> > sqlite-users mailing list
> > [hidden email]
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > _______________________________________________
> > sqlite-users mailing list
> > [hidden email]
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users