slow join, fast subselect

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

slow join, fast subselect

Poor Yorick
I've used the following two test queries in a version of sqlite built against a
recent checkout of trunk, and also another recent version of sqlite.  a.ref is
indexed.  The subselect query is faster than the join query -- two orders of
magnitude faster on a larger dataset.  Is sqlite missing some easy optimisation
opportunity here?


select a.rowid
from a join b on a.rowid = b.rowid
where a.ref = $x


select a.rowid
from a,b
where a.ref = $x and a.rowid in (select rowid from b)


--
Poor Yorick

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: slow join, fast subselect

R Smith-2
On 2019/04/17 10:55 AM, Poor Yorick wrote:

> I've used the following two test queries in a version of sqlite built against a
> recent checkout of trunk, and also another recent version of sqlite.  a.ref is
> indexed.  The subselect query is faster than the join query -- two orders of
> magnitude faster on a larger dataset.  Is sqlite missing some easy optimisation
> opportunity here?
>
>
> select a.rowid
> from a join b on a.rowid = b.rowid
> where a.ref = $x
>
>
> select a.rowid
> from a,b
> where a.ref = $x and a.rowid in (select rowid from b)


These queries have vastly different meanings and as such must have
vastly different execution plans. I won't go into that because it's
obvious, so I will just mention the assumptions.

For the query engine to optimize this according to your suggestion (i.e.
for it to really do the second thing when you've asked for the first
thing) it has to assume the following:

- 1. You do not need any of the fields from b in any part of the query
(which IS the case here and can be deduced, but is an extremely small
likelihood for JOINs in general queries),
- 2. there are no special Collations on either of the columns (again,
this can be deduced here), and
- 3. it must assume that every b.rowid appears ONLY once in table b (I
know this can be assumed for rowid's, but is something that cannot be
easily known for any other field, where it could be true with or without
a unique constraint).

And, these are only the obvious outsider-viewpoint assumptions, I have
no insight in what other checks might be needed internally from the QP.
This means that this optimization carries a cost deficit in the general
case. i.e. Wasting CPU cycles "looking" for this optimization
opportunity in the general case, considering the high unlikelihood of
finding it viable, will slow down many more queries than it will help.

That said, this is the very kind of thing where we as engineering
programmers or database queryers could optimize by asking our questions
right, mind you - I am not advocating thinking for the Query Planner,
but do state your question optimally. If you really want to know how
many rabbits have been eaten by foxes, do not ask for a list of all
rabbit names alphabetically and exclude every one that was NOT eaten by
a fox, simply ask for the count of rabbits where eaten by fox = true.

In your above example you really wish to know all the a's which have an
entry in b. The first query asks to join and list all b's found for
every a (which works mathematically in this case by virtue of rowid
uniqueness, but isn't the real question and forces a lot of "join"
algorithm checking on the QP), the correct question is the second query:
Show every a which can also be found in b. It releases the QP of a lot
of responsibility and let's it follow a plan that is much faster.


Hope that makes sense :)

Ryan


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] slow join, fast subselect

Hick Gunter
In reply to this post by Poor Yorick
Try EXPLAIN QUERY PLAN <stmt> or even EXPLAIN <stmt> to see what is going on in each case.

-----Urspr√ľngliche Nachricht-----
Von: sqlite-users [mailto:[hidden email]] Im Auftrag von Poor Yorick
Gesendet: Mittwoch, 17. April 2019 10:56
An: [hidden email]
Betreff: [EXTERNAL] [sqlite] slow join, fast subselect

I've used the following two test queries in a version of sqlite built against a recent checkout of trunk, and also another recent version of sqlite.  a.ref is indexed.  The subselect query is faster than the join query -- two orders of magnitude faster on a larger dataset.  Is sqlite missing some easy optimisation opportunity here?


select a.rowid
from a join b on a.rowid = b.rowid
where a.ref = $x


select a.rowid
from a,b
where a.ref = $x and a.rowid in (select rowid from b)


--
Poor Yorick

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


___________________________________________
 Gunter Hick | Software Engineer | Scientific Games International GmbH | Klitschgasse 2-4, A-1130 Vienna | FN 157284 a, HG Wien, DVR: 0430013 | (O) +43 1 80100 - 0

May be privileged. May be confidential. Please delete if not the addressee.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] slow join, fast subselect

Poor Yorick

On Wed, Apr 17, 2019 at 10:15:31AM +0000, Hick Gunter wrote:
> Try EXPLAIN QUERY PLAN <stmt> or even EXPLAIN <stmt> to see what is going on in each case.


I already have, of course.  The question is, how much effort would it be to get
sqlite choose the better query plan in the "join" case as well?

--
Poor Yorick
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: slow join, fast subselect

Poor Yorick
In reply to this post by R Smith-2
On Wed, Apr 17, 2019 at 11:36:18AM +0200, R Smith wrote:
> On 2019/04/17 10:55 AM, Poor Yorick wrote:
[SNIP]

>
> In your above example you really wish to know all the a's which have an
> entry in b. The first query asks to join and list all b's found for every a
> (which works mathematically in this case by virtue of rowid uniqueness, but
> isn't the real question and forces a lot of "join" algorithm checking on the
> QP), the correct question is the second query: Show every a which can also
> be found in b. It releases the QP of a lot of responsibility and let's it
> follow a plan that is much faster.
>
>
> Hope that makes sense :)
>
> Ryan
>

That's an apt and accessible description of the issue, but at the denotational
level the meanings of the queries are in fact identical under the conditions
you enumerated.  Ideally sqlite would notice and adjust its query plan
accordingly.  If the cost of doing so doesn't justify the effort, that could be
documented.  As good as the sqlite documentation is, it currently lacks this
sort of higher-level guidance.

--
Poor Yorick
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: slow join, fast subselect

R Smith-2
On 2019/04/17 1:23 PM, Poor Yorick wrote:
>
> That's an apt and accessible description of the issue, but at the denotational
> level the meanings of the queries are in fact identical under the conditions
> you enumerated.  Ideally sqlite would notice and adjust its query plan
> accordingly.

Ideally yes, but the heart of the matter, I would like to reiterate, is
that the query in question has to be just so for the optimization to
work, anything you change (using b.anything anywhere, having duplicate b
values or in any way not using rowids or at least unique columns of some
flavour, using a collation, etc.) would render the optimization
opportunity void. The opportunity hangs by the tiniest of threads, or
put another way: the decision tree + checks to arrive at this
optimization opportunity is very long, and I doubt a similar query is
encountered by the QP more than one in a few million times (with a
strong possibility that that should read "in a few billion times") ever
- and when it does come up, the programmer most probably would opt for
the latter example since it more clearly describes the plot. (The fact
that it is also faster is just lucky).

If on the other hand a situation presents itself where the obvious
better plot description (i.e. the second query) was /slower/  - then the
optimization (to rather do it like the faster first query) would be a
more worthy cause.

The fact that we can engineer a query that happens to be equivalent
mathematically and is slower does not sufficiently call for an effort to
improve the planner. In fact, we can engineer many such queries, and
these kinds of things come up quite regularly on the forum. To be sure,
they often do get optimized IF the slower query is the one that is
semantically more sensible and the effort to implement plus the added
code-weight and cpu cycles are justified by the gain in the general case.


>    If the cost of doing so doesn't justify the effort, that could be
> documented.  As good as the sqlite documentation is, it currently lacks this
> sort of higher-level guidance.

I agree, perhaps some general description might be useful, though it's
hard to imagine it mentioning specific scenarios. Consider that the
amount of semantically-different-but-functionally-equivalent-yet-slower
queries that can be engineered must be legion (the forum produces new
ones almost monthly). It's hard to fathom documenting all of them - plus
some of them disappear with improvements over time.

Maybe you could volunteer a paragraph of documentation that would have
adequately satisfied your question in this regard and in general - the
devs often do amend documentation based on suggestions here.


Cheers,
Ryan

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: slow join, fast subselect

R Smith-2
In reply to this post by Poor Yorick
Also, let me just add (in case it sounded different) - We definitely do
not wish to document the "why an optimization opportunity might not be
feasible" in any way that would discourage anyone from submitting such a
possible optimization opportunity. That would work against the axiomatic
premise of this forum, sqlite and open source communities in general.


On 2019/04/17 1:23 PM, Poor Yorick wrote:
>
>
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] slow join, fast subselect

David Raymond
In reply to this post by Poor Yorick
Would you post what those explain query plans results are? All the other replies not withstanding I'm still curious as to why #2 would be faster (assuming "rowid" is indeed the actual rowid anyway)

Also, is that a typo in #2, if you're not using b, why would you include it in the from clause? Wouldn't that introduce a whole bunch of duplicates? As in a copy of a.rowid for every single record in b? (Maybe my brain just hasn't finished warming up this morning)

#1
select a.rowid
from a join b on a.rowid = b.rowid
where a.ref = $x

#2
select a.rowid
from a,b
where a.ref = $x and a.rowid in (select rowid from b)



-----Original Message-----
From: sqlite-users [mailto:[hidden email]] On Behalf Of Poor Yorick
Sent: Wednesday, April 17, 2019 6:32 AM
To: SQLite mailing list
Subject: Re: [sqlite] [EXTERNAL] slow join, fast subselect


On Wed, Apr 17, 2019 at 10:15:31AM +0000, Hick Gunter wrote:
> Try EXPLAIN QUERY PLAN <stmt> or even EXPLAIN <stmt> to see what is going on in each case.


I already have, of course.  The question is, how much effort would it be to get
sqlite choose the better query plan in the "join" case as well?

--
Poor Yorick
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: slow join, fast subselect

Radovan Antloga
In reply to this post by Poor Yorick
Second select is not correct. You should have "from a" only not "from a,b".

R.Antloga

On 17.04.2019 10:55, Poor Yorick wrote:

> I've used the following two test queries in a version of sqlite built against a
> recent checkout of trunk, and also another recent version of sqlite.  a.ref is
> indexed.  The subselect query is faster than the join query -- two orders of
> magnitude faster on a larger dataset.  Is sqlite missing some easy optimisation
> opportunity here?
>
>
> select a.rowid
> from a join b on a.rowid = b.rowid
> where a.ref = $x
>
>
> select a.rowid
> from a,b
> where a.ref = $x and a.rowid in (select rowid from b)
>
>

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] slow join, fast subselect

Poor Yorick
In reply to this post by David Raymond
On Wed, Apr 17, 2019 at 01:24:11PM +0000, David Raymond wrote:

> Would you post what those explain query plans results are? All the other replies not withstanding I'm still curious as to why #2 would be faster (assuming "rowid" is indeed the actual rowid anyway)
>
> Also, is that a typo in #2, if you're not using b, why would you include it in the from clause? Wouldn't that introduce a whole bunch of duplicates? As in a copy of a.rowid for every single record in b? (Maybe my brain just hasn't finished warming up this morning)
>
> #1
> select a.rowid
> from a join b on a.rowid = b.rowid
> where a.ref = $x
>
> #2
> select a.rowid
> from a,b
> where a.ref = $x and a.rowid in (select rowid from b)
>
>

3 0 0 {SEARCH TABLE a USING COVERING INDEX idx_ref (ref=?)}
8 0 0 {SEARCH TABLE b USING INTEGER PRIMARY KEY (rowid=?)}

2 0 0 {SEARCH TABLE a USING COVERING INDEX idx_ref (ref=? AND rowid=?)}
7 0 0 {USING ROWID SEARCH ON TABLE b FOR IN-OPERATOR}

--
Poor Yorick
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] slow join, fast subselect

Keith Medcalf

Your made up plans are intriguing.  The plan you show for the latter query omit to join a and b.  Are you just making things up?

sqlite> select a.rowid from a, b where a.ref=7 and a.rowid in (select rowid from b);
QUERY PLAN
|--SEARCH TABLE a USING COVERING INDEX aa (ref=? AND rowid=?) (~8 rows)
|--USING ROWID SEARCH ON TABLE b FOR IN-OPERATOR
`--SCAN TABLE b (~1048576 rows)

However, assuming that you typo'd the query (please learn about this new-fangled thing called cut-n-paste to avoid that error in the future) you get this if you specify "FROM a" rather than "FROM a, b".

sqlite> select rowid from a where ref=7 and rowid in (select rowid from b);
QUERY PLAN
|--SEARCH TABLE a USING COVERING INDEX aa (ref=? AND rowid=?) (~8 rows)
`--USING ROWID SEARCH ON TABLE b FOR IN-OPERATOR

In any case, upon closer examination the VDBE code for the two queries "solves" the query quite differently:

select a.rowid from a, b where a.rowid == b.rowid and a.ref == ?

places the index constrained by ref = ? in the outer loop, and then if a.rowid exists in table b, returns a result row.  This means that the number of times the outer loop is executed is dependent on the constraint ref == ? on the index.  Since there can be no possible statistic for the a.rowid == b.rowid condition, table a (or rather the index) will and must always be in the outer loop, even if the constraint ref == ? selects all rows in table a and there is only one row in table b ...

select rowid from a where ref == ? and rowid in (select rowid from b)

however scans table b in the outer loop and then does an index lookup on (ref, rowid) into the index and returns a result row whenever it is found.  This means that the number of times the outer loop is executed is dependent on the number of rows in table b (only).  Since the QP can determine the number of rows in a where ref == 7 and the number of rows in b, the plan will probably be optimized by having statistics available.  Do you have statistics available?  Or do you just by happenstance have less rows in b than in a constrained by ref == ?.  Do you have statistics?  Have you run ANALYZE?

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.


>-----Original Message-----
>From: sqlite-users [mailto:sqlite-users-
>[hidden email]] On Behalf Of Poor Yorick
>Sent: Wednesday, 17 April, 2019 08:48
>To: SQLite mailing list
>Subject: Re: [sqlite] [EXTERNAL] slow join, fast subselect
>
>On Wed, Apr 17, 2019 at 01:24:11PM +0000, David Raymond wrote:
>> Would you post what those explain query plans results are? All the
>other replies not withstanding I'm still curious as to why #2 would
>be faster (assuming "rowid" is indeed the actual rowid anyway)
>>
>> Also, is that a typo in #2, if you're not using b, why would you
>include it in the from clause? Wouldn't that introduce a whole bunch
>of duplicates? As in a copy of a.rowid for every single record in b?
>(Maybe my brain just hasn't finished warming up this morning)
>>
>> #1
>> select a.rowid
>> from a join b on a.rowid = b.rowid
>> where a.ref = $x
>>
>> #2
>> select a.rowid
>> from a,b
>> where a.ref = $x and a.rowid in (select rowid from b)
>>
>>
>
>3 0 0 {SEARCH TABLE a USING COVERING INDEX idx_ref (ref=?)}
>8 0 0 {SEARCH TABLE b USING INTEGER PRIMARY KEY (rowid=?)}
>
>2 0 0 {SEARCH TABLE a USING COVERING INDEX idx_ref (ref=? AND
>rowid=?)}
>7 0 0 {USING ROWID SEARCH ON TABLE b FOR IN-OPERATOR}
>
>--
>Poor Yorick
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] slow join, fast subselect

Poor Yorick
On Wed, Apr 17, 2019 at 11:43:13AM -0600, Keith Medcalf wrote:
>
> Your made up plans are intriguing.  The plan you show for the latter query omit to join a and b.  Are you just making things up?

The query plans were cut and pasted from the terminal.  It's easy enough to
deduce where these plans came from:  As someone else pointed out, the ",b" in
the second query shouldn't be there, so I removed it before generating the
query plans.  That step of the query plan is irrelevant anyway.  The point is
that in the subselect variant the query the planner chooses this
                                                                                                                                                                                                                                 
  7 0 0 {USING ROWID SEARCH ON TABLE b FOR IN-OPERATOR}                                                                                                                                                                          

which, given the conditions, is a far better choise than what the planner
chooses in the "join" variant:
                                                                                                                                                                                                                                 
  8 0 0 {SEARCH TABLE b USING INTEGER PRIMARY KEY (rowid=?)}                                                                                                                                                                      

It would be easy enough again for the planner to deduce this, but as Ryan
Smith described, may not be worth doing in the general case.  I don't know.
I'm just reporting in from the field.


--
Poor Yorick

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users