Can select * from table where not exists (subquery) be optimized?

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Can select * from table where not exists (subquery) be optimized?

Shane Dev
Hello,

I have a directed acyclic graph defined as follows -

sqlite> .sch
CREATE TABLE nodes(id integer primary key, description text);
CREATE TABLE edges(parent not null references nodes, child not null
references nodes, primary key(parent, child));

Now I want to find the "roots" of the graph - i.e nodes which are not
children of other nodes -

select * from nodes where not exists (select * from edges where child=
nodes.id);

This works but is very slow when there are a million nodes and edges.

Looking at the query plan -

sqlite> explain query plan select * from nodes where not exists (select *
from edges where child=nodes.id);
selectid        order   from    detail
0       0       0       SCAN TABLE nodes
0       0       0       EXECUTE CORRELATED SCALAR SUBQUERY 1
1       0       0       SCAN TABLE edges

I thought an index on edges(child) might help

sqlite> CREATE INDEX iedges on edges(child);

but it didn't -

sqlite> explain query plan select * from nodes where not exists (select *
from edges where child=nodes.id);
selectid        order   from    detail
0       0       0       SCAN TABLE nodes
0       0       0       EXECUTE CORRELATED SCALAR SUBQUERY 1
1       0       0       SCAN TABLE edges

Is there any way to speed it up?
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Can select * from table where not exists (subquery) be optimized?

petern
This query will use the index you proposed.

SELECT * FROM nodes NATURAL JOIN (SELECT parent AS id FROM edges WHERE
parent NOT IN (SELECT child FROM edges));

Peter

On Sun, Dec 31, 2017 at 6:14 PM, Shane Dev <[hidden email]> wrote:

> Hello,
>
> I have a directed acyclic graph defined as follows -
>
> sqlite> .sch
> CREATE TABLE nodes(id integer primary key, description text);
> CREATE TABLE edges(parent not null references nodes, child not null
> references nodes, primary key(parent, child));
>
> Now I want to find the "roots" of the graph - i.e nodes which are not
> children of other nodes -
>
> select * from nodes where not exists (select * from edges where child=
> nodes.id);
>
> This works but is very slow when there are a million nodes and edges.
>
> Looking at the query plan -
>
> sqlite> explain query plan select * from nodes where not exists (select *
> from edges where child=nodes.id);
> selectid        order   from    detail
> 0       0       0       SCAN TABLE nodes
> 0       0       0       EXECUTE CORRELATED SCALAR SUBQUERY 1
> 1       0       0       SCAN TABLE edges
>
> I thought an index on edges(child) might help
>
> sqlite> CREATE INDEX iedges on edges(child);
>
> but it didn't -
>
> sqlite> explain query plan select * from nodes where not exists (select *
> from edges where child=nodes.id);
> selectid        order   from    detail
> 0       0       0       SCAN TABLE nodes
> 0       0       0       EXECUTE CORRELATED SCALAR SUBQUERY 1
> 1       0       0       SCAN TABLE edges
>
> Is there any way to speed it up?
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Can select * from table where not exists (subquery) be optimized?

Clemens Ladisch
In reply to this post by Shane Dev
Shane Dev wrote:
> CREATE TABLE nodes(id integer primary key, description text);
> CREATE TABLE edges(parent not null references nodes, child not null references nodes, primary key(parent, child));
>
> select * from nodes where not exists (select * from edges where child=nodes.id);
>
> This works but is very slow when there are a million nodes and edges.
> I thought an index on edges(child) might help
> but it didn't

The index has the wrong affinity; the edges.child column must be integer.

The following query would be simpler, and does not need the index (because SQLite
always creates a temporary index for the lookup anyway):

  select * from nodes where id not in (select child from edges);


Regards,
Clemens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Can select * from table where not exists (subquery) be optimized?

Luuk
In reply to this post by Shane Dev
On 01-01-18 03:14, Shane Dev wrote:

> Hello,
>
> I have a directed acyclic graph defined as follows -
>
> sqlite> .sch
> CREATE TABLE nodes(id integer primary key, description text);
> CREATE TABLE edges(parent not null references nodes, child not null
> references nodes, primary key(parent, child));
>
> Now I want to find the "roots" of the graph - i.e nodes which are not
> children of other nodes -
>
> select * from nodes where not exists (select * from edges where child=
> nodes.id);
>
> This works but is very slow when there are a million nodes and edges.

Changing this to:

select * from nodes where not exists (select 1 from edges where child=
nodes.id);

saved in my test about 10% of time

> Is there any way to speed it up?

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Can select * from table where not exists (subquery) be optimized?

Luuk


On 01-01-18 12:18, Luuk wrote:

> On 01-01-18 03:14, Shane Dev wrote:
>> Hello,
>>
>> I have a directed acyclic graph defined as follows -
>>
>> sqlite> .sch
>> CREATE TABLE nodes(id integer primary key, description text);
>> CREATE TABLE edges(parent not null references nodes, child not null
>> references nodes, primary key(parent, child));
>>
>> Now I want to find the "roots" of the graph - i.e nodes which are not
>> children of other nodes -
>>
>> select * from nodes where not exists (select * from edges where child=
>> nodes.id);
>>
>> This works but is very slow when there are a million nodes and edges.
> Changing this to:
>
> select * from nodes where not exists (select 1 from edges where child=
> nodes.id);
>
> saved in my test about 10% of time
>
>> Is there any way to speed it up?

Above is the same as what Clemens already mentioned ...

sqlite> explain query plan select * from nodes where not exists (select
* from edges where child=nodes.id);
0|0|0|SCAN TABLE nodes
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SCAN TABLE edges
Run Time: real 0.009 user 0.000000 sys 0.000000
sqlite>
sqlite> explain query plan select * from nodes where not exists (select
1 from edges where child=nodes.id);
0|0|0|SCAN TABLE nodes
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 1
1|0|0|SCAN TABLE edges USING COVERING INDEX iedges
Run Time: real 0.003 user 0.000000 sys 0.000000
sqlite>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Can select * from table where not exists (subquery) be optimized?

Shane Dev
In reply to this post by Clemens Ladisch
Hi Clemens,

Your query is much faster on my system - thanks!

Apart from visual inspection and testing, is there anyway to be sure your
query selects the same results as my query?

From https://sqlite.org/queryplanner.html "When programming in SQL you tell
the system what you want to compute, not how to compute it". Is this an
exception to the rule where the query planner must be told how to compute
the result?




On 1 January 2018 at 10:58, Clemens Ladisch <[hidden email]> wrote:

> Shane Dev wrote:
> > CREATE TABLE nodes(id integer primary key, description text);
> > CREATE TABLE edges(parent not null references nodes, child not null
> references nodes, primary key(parent, child));
> >
> > select * from nodes where not exists (select * from edges where child=
> nodes.id);
> >
> > This works but is very slow when there are a million nodes and edges.
> > I thought an index on edges(child) might help
> > but it didn't
>
> The index has the wrong affinity; the edges.child column must be integer.
>
> The following query would be simpler, and does not need the index (because
> SQLite
> always creates a temporary index for the lookup anyway):
>
>   select * from nodes where id not in (select child from edges);
>
>
> Regards,
> Clemens
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Can select * from table where not exists (subquery) be optimized?

Clemens Ladisch
In reply to this post by Luuk
Luuk wrote:
> On 01-01-18 03:14, Shane Dev wrote:
>> select * from nodes where not exists (select * from edges where child=nodes.id);
>
> Changing this to:
>
>  select * from nodes where not exists (select 1 from edges where child=nodes.id);
>
> saved in my test about 10% of time

Then I have to doubt your test; the generated code (see the EXPLAIN
output) is exactly the same.


Regards,
Clemens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Can select * from table where not exists (subquery) be optimized?

Clemens Ladisch
In reply to this post by Shane Dev
Shane Dev wrote:
> Apart from visual inspection and testing, is there anyway to be sure your
> query selects the same results as my query?

Can I interest you in things like relational algebra or tuple calculus?
;-)

>>> select * from nodes where not exists (select * from edges where child=nodes.id);
>>  select * from nodes where id not in (select child from edges);
>
> From https://sqlite.org/queryplanner.html "When programming in SQL you tell
> the system what you want to compute, not how to compute it". Is this an
> exception to the rule where the query planner must be told how to compute
> the result?

Arguable, the second query is on a higher abstraction level, i.e., the
_first_ one tells the system how to compute it.  (AFAIK SQLite is able
to transform an IN lookup into a correlated subquery, if worthwhile, but
not in the opposite direction.)


Regards,
Clemens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Can select * from table where not exists (subquery) be optimized?

E.Pasma
In reply to this post by Shane Dev
Shane Dev wrote:

> Hi Clemens,
>
> Your query is much faster on my system - thanks!
>
> Apart from visual inspection and testing, is there anyway to be sure  
> your
> query selects the same results as my query?
>
> From https://sqlite.org/queryplanner.html "When programming in SQL  
> you tell
> the system what you want to compute, not how to compute it". Is this  
> an
> exception to the rule where the query planner must be told how to  
> compute
> the result?

>>> select * from nodes where not exists (select * from edges where  
>>> child=nodes.id);
>>
>>> select * from nodes where not exists (select 1 from edges where  
>>> child=nodes.id);
>>
>>  select * from nodes where id not in (select child from edges);


The first two queries look more 'procedural' than the last. So this  
may confirm "When programming in SQL you tell the system what you want  
to compute, not how to compute it"

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Can select * from table where not exists (subquery) be optimized?

E.Pasma
In reply to this post by Clemens Ladisch
Clemens Ladisch wrote:

> Luuk wrote:
>> On 01-01-18 03:14, Shane Dev wrote:
>>> select * from nodes where not exists (select * from edges where  
>>> child=nodes.id);
>>
>> Changing this to:
>>
>> select * from nodes where not exists (select 1 from edges where  
>> child=nodes.id);
>>
>> saved in my test about 10% of time
>
> Then I have to doubt your test; the generated code (see the EXPLAIN
> output) is exactly the same.


Yes, the execution plans are the same. Still the second is faster!

Another query that uses EXCEPT instead of NOT IN and that has yet  
another execution plan:

select id from nodes except select child from edges;
 
             
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Can select * from table where not exists (subquery) be optimized?

Luuk
On 01-01-18 16:52, E.Pasma wrote:

> Clemens Ladisch wrote:
>
>> Luuk wrote:
>>> On 01-01-18 03:14, Shane Dev wrote:
>>>> select * from nodes where not exists (select * from edges where
>>>> child=nodes.id);
>>>
>>> Changing this to:
>>>
>>> select * from nodes where not exists (select 1 from edges where
>>> child=nodes.id);
>>>
>>> saved in my test about 10% of time
>>
>> Then I have to doubt your test; the generated code (see the EXPLAIN
>> output) is exactly the same.
>
>

the 3rd step of EXPLAIN changed from:

1|0|0|SCAN TABLE edges
to:
1|0|0|SCAN TABLE edges USING COVERING INDEX iedges


luuk@opensuse:~/tmp> sqlite3 nodes.db
SQLite version 3.8.10.2 2015-05-20 18:17:19
Enter ".help" for usage hints.
sqlite> .timer on
sqlite> select * from nodes where not exists (select * from edges where
child=nodes.id) and id<100000;
Run Time: real 254.400 user 254.007998 sys 0.119976
sqlite> select * from nodes where not exists (select * from edges where
child=nodes.id) and id<100000;
Run Time: real 234.342 user 233.348752 sys 0.491637
sqlite> select * from nodes where not exists (select 1 from edges where
child=nodes.id) and id<100000;
Run Time: real 219.904 user 218.968920 sys 0.651569
sqlite> select * from nodes where not exists (select 1 from edges where
child=nodes.id) and id<100000;
Run Time: real 219.929 user 219.562780 sys 0.127948
sqlite> select * from nodes where not exists (select * from edges where
child=nodes.id) and id<100000;
Run Time: real 236.423 user 234.648774 sys 1.622957




_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Can select * from table where not exists (subquery) be optimized?

Clemens Ladisch
Luuk wrote:

>> Clemens Ladisch wrote:
>>> Luuk wrote:
>>>> On 01-01-18 03:14, Shane Dev wrote:
>>>>> select * from nodes where not exists (select * from edges where
>>>>> child=nodes.id);
>>>>
>>>> Changing this to:
>>>>
>>>> select * from nodes where not exists (select 1 from edges where
>>>> child=nodes.id);
>>>>
>>>> saved in my test about 10% of time
>>>
>>> Then I have to doubt your test; the generated code (see the EXPLAIN
>>> output) is exactly the same.
>
> the 3rd step of EXPLAIN changed from:
>
> 1|0|0|SCAN TABLE edges
> to:
> 1|0|0|SCAN TABLE edges USING COVERING INDEX iedges

Indeed it does; my own test was without the index.


Regards,
Clemens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users