Quantcast

Query plan gone haywire lays waste to my library's performance

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Query plan gone haywire lays waste to my library's performance

Jens Alfke-2
I’ve got an urgent-to-me issue wherein a customer says that our library has slowed down by several orders of magnitude in iOS 10.3 compared to iOS 9. I ran a test case they provided and found that some queries that should be fast are taking a very long time (~500ms) and generating huge numbers of reads. Using EXPLAIN QUERY PLAN I found that the optimizer has generated a horrifically bad plan, essentially an inside-out join, that results in linear scans of two large tables. (Presumably this is due to some change in the optimizer between the older and newer versions of SQLite in iOS.)

Now I need to figure out how to fix this ASAP. Our library periodically runs ANALYZE, so sqlite might not even start out using the bad query plan; it might happen only with large databases or with some specific data sets.

The query, with some extraneous details removed, looks like:
        SELECT docs.doc_id, revs.sequence, docs.docid, rev.revid, rev.json, […] FROM revs, docs
        WHERE revs.sequence>? AND revs.current!=0 AND revs.doc_id = docs.doc_id
        ORDER BY revs.doc_id, revs.deleted, revs.revid DESC

The “revs” table has “sequence” as its integer primary key, and a foreign key “doc_id” referencing the “docs” table. (Plus many other columns.)
The “docs” table has “doc_id” as its primary key and “docid” as a text column.

Basically this query is meant to find the latest revs, including their “docid” strings joined from the “docs” table. It should just be using the “revs” table’s primary index starting from the “?” parameter, then looking up the corresponding doc for each revision. Instead it does this:

selectid = 0
   order = 0
    from = 1
  detail = SCAN TABLE docs USING COVERING INDEX docs_docid

selectid = 0
   order = 1
    from = 0
  detail = SEARCH TABLE revs USING INDEX revs_current (doc_id=?)

selectid = 0
   order = 0
    from = 0
  detail = USE TEMP B-TREE FOR ORDER BY

It’s scanning through all 50,000+ rows in “docs”, and then for each document looking up all the rows in “revs” that reference it (“revs” has 175,000 rows), and comparing their sequences to the parameter.

(I’ve pasted the relevant parts of the db schema at the bottom of this email if you want to refer to them.)

My first priority is to find some way to restate this query such that it will always use the revs.sequence primary key index.

A lower priority is to know whether there’s a valid reason the query planner came up with this, or if it’s a bug in the optimizer or the ANALYZE command? In the planner’s defense, it probably doesn’t know that this query is almost always run with the starting sequence (“?”) relatively close to the maximum sequence.  If low numbers were frequently given for “?”, it’s possible that its current plan would be more efficient…

Thanks,

—Jens

CREATE TABLE docs (doc_id INTEGER PRIMARY KEY, docid TEXT UNIQUE NOT NULL, expiry_timestamp INTEGER);
CREATE INDEX docs_docid ON docs(docid);
CREATE TABLE revs (sequence INTEGER PRIMARY KEY AUTOINCREMENT, doc_id INTEGER NOT NULL REFERENCES docs(doc_id) ON DELETE CASCADE, revid TEXT NOT NULL COLLATE REVID, parent INTEGER REFERENCES revs(sequence) ON DELETE SET NULL,  current BOOLEAN,                deleted BOOLEAN DEFAULT 0, json BLOB, no_attachments BOOLEAN, doc_type TEXT, UNIQUE (doc_id, revid));
CREATE INDEX revs_parent ON revs(parent);
CREATE INDEX revs_by_docid_revid ON revs(doc_id, revid desc, current, deleted);
CREATE INDEX revs_current ON revs(doc_id, current desc, deleted, revid desc);

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Query plan gone haywire lays waste to my library's performance

Jens Alfke-2
Sorry for the quick follow-up, but I’ve just run a quick test that does an EXPLAIN QUERY PLAN for every statement that gets compiled, in an empty database, and confirmed that the query plan comes out wrong right from the start:

SELECT revs.doc_id, sequence, docid, revid, json, deleted FROM revs, docs WHERE sequence>? AND current!=0 AND deleted=0 AND revs.doc_id = docs.doc_id ORDER BY revs.doc_id, deleted, revid DESC
0 0 0 SCAN TABLE revs USING INDEX revs_by_docid_revid
0 1 1 SEARCH TABLE docs USING INTEGER PRIMARY KEY (rowid=?)

Another query gets it wrong too:

SELECT sequence, revs.doc_id, docid, revid, deleted  FROM revs, docs WHERE sequence > ? AND current=1 AND revs.doc_id = docs.doc_id ORDER BY revs.doc_id, deleted, revid DESC

0 0 0 SCAN TABLE revs USING COVERING INDEX revs_current
0 1 1 SEARCH TABLE docs USING INTEGER PRIMARY KEY (rowid=?)
0 0 0 USE TEMP B-TREE FOR RIGHT PART OF ORDER BY

This is with SQLite 3.16.0.

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Query plan gone haywire lays waste to my library's performance

Simon Slavin-3
On 28 Apr 2017, at 9:49pm, Jens Alfke <[hidden email]> wrote:

> SELECT revs.doc_id, sequence, docid, revid, json, deleted FROM revs, docs WHERE sequence>? AND current!=0 AND deleted=0 AND revs.doc_id = docs.doc_id ORDER BY revs.doc_id, deleted, revid DESC
> 0 0 0 SCAN TABLE revs USING INDEX revs_by_docid_revid
> 0 1 1 SEARCH TABLE docs USING INTEGER PRIMARY KEY (rowid=?)
>
> Another query gets it wrong too:
>
> SELECT sequence, revs.doc_id, docid, revid, deleted  FROM revs, docs WHERE sequence > ? AND current=1 AND revs.doc_id = docs.doc_id ORDER BY revs.doc_id, deleted, revid DESC
>
> 0 0 0 SCAN TABLE revs USING COVERING INDEX revs_current
> 0 1 1 SEARCH TABLE docs USING INTEGER PRIMARY KEY (rowid=?)
> 0 0 0 USE TEMP B-TREE FOR RIGHT PART OF ORDER BY

What indexes do you have on these two tables ?  I can’t recommend one without knowing which columns belong to which tables.

Once you have the indexes you want, run ANALYZE.  Or rather, make sure that the customers’ database files have had ANALYZE run on them while they contain typical data.

First query rewritten for clarity:

SELECT revs.doc_id, sequence, docid, revid, json, deleted FROM revs, docs
        WHERE sequence>? AND current!=0 AND deleted=0
        AND revs.doc_id = docs.doc_id
        ORDER BY revs.doc_id, deleted, revid DESC

Should be more like

SELECT revs.doc_id, sequence, docid, revid, json, deleted FROM revs
        JOIN docs ON docs.doc_id = revs.doc_id
        WHERE sequence>? AND current!=0 AND deleted=0
        ORDER BY revs.doc_id, deleted, revid DESC

Second query should be more like:

SELECT revs.doc_id, sequence, docid, revid, deleted FROM revs
        JOIN docs ON revs.doc_id = docs.doc_id
        WHERE sequence > ? AND current=1
        ORDER BY revs.doc_id, deleted, revid DESC
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Query plan gone haywire lays waste to my library's performance

Jens Alfke-2

> On Apr 28, 2017, at 2:00 PM, Simon Slavin <[hidden email]> wrote:
>
> What indexes do you have on these two tables ?  I can’t recommend one without knowing which columns belong to which tables.

I showed them at the end of my first message.

> First query rewritten for clarity:
> […] Should be more like
>
> SELECT revs.doc_id, sequence, docid, revid, json, deleted FROM revs
> JOIN docs ON docs.doc_id = revs.doc_id
> WHERE sequence>? AND current!=0 AND deleted=0
> ORDER BY revs.doc_id, deleted, revid DESC

The only difference is the explicit JOIN statement. I was under the impression that using this, vs. the way I wrote it, is a matter of taste that doesn’t affect the execution of the query. (I just tried changing some queries to use JOIN syntax and it didn’t affect their query plans.)

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Query plan gone haywire lays waste to my library's performance

Simon Slavin-3

On 28 Apr 2017, at 10:23pm, Jens Alfke <[hidden email]> wrote:

> On Apr 28, 2017, at 2:00 PM, Simon Slavin <[hidden email]> wrote:
>
>> What indexes do you have on these two tables ?  I can’t recommend one without knowing which columns belong to which tables.
>
> I showed them at the end of my first message.

Sorry, I missed that.

> CREATE TABLE docs (doc_id INTEGER PRIMARY KEY, docid TEXT UNIQUE NOT NULL, expiry_timestamp INTEGER);
> CREATE INDEX docs_docid ON docs(docid);

> CREATE TABLE revs (sequence INTEGER PRIMARY KEY AUTOINCREMENT, doc_id INTEGER NOT NULL REFERENCES docs(doc_id) ON DELETE CASCADE, revid TEXT NOT NULL COLLATE REVID, parent INTEGER REFERENCES revs(sequence) ON DELETE SET NULL,  current BOOLEAN,                deleted BOOLEAN DEFAULT 0, json BLOB, no_attachments BOOLEAN, doc_type TEXT, UNIQUE (doc_id, revid));
> CREATE INDEX revs_parent ON revs(parent);
> CREATE INDEX revs_by_docid_revid ON revs(doc_id, revid desc, current, deleted);
> CREATE INDEX revs_current ON revs(doc_id, current desc, deleted, revid desc);

This search

SELECT revs.doc_id, sequence, docid, revid, json, deleted FROM revs
        JOIN docs ON docs.doc_id = revs.doc_id
        WHERE revs.sequence>? AND revs.current!=0 AND revs.deleted=0
        ORDER BY revs.doc_id, revs.deleted, revs.revid DESC

can be improved if you know that revs.current is never negative.  If that’s the case then rephrase it

SELECT revs.doc_id, sequence, docid, revid, json, deleted FROM revs
        JOIN docs ON docs.doc_id = revs.doc_id
        WHERE revs.sequence>? AND revs.current>0 AND revs.deleted=0
        ORDER BY revs.doc_id, revs.deleted, revs.revid DESC

Either way, the search could benefit from a better index on revs.  Depending on your data it would be one of these four:

CREATE INDEX revs_xxxx1 ON revs(deleted,current,sequence,doc_id,revid);
CREATE INDEX revs_xxxx2 ON revs(deleted,current,doc_id,sequence,revid);
CREATE INDEX revs_xxxx3 ON revs(deleted,sequence,current,doc_id,revid);
CREATE INDEX revs_xxxx4 ON revs(deleted,sequence,doc_id,current,revid);

Create those four in a database with typical data in it.  Then do ANALYZE.  Then run the query and see if time has improved.  Once you see which index SQLite decided to use, you can delete the other ones.

> The only difference is the explicit JOIN statement. I was under the impression that using this, vs. the way I wrote it, is a matter of taste that doesn’t affect the execution of the query.


SQLite computes two-table searches using nested loops.  Providing the JOIN you want tells SQLite which order you think the loops should be nested in.  Expressing the searches with JOINs is also helps me figure out what SQLite is actually doing.  Before I did that it wasn’t obvious to me how little the table "docs" mattered for this query.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Query plan gone haywire lays waste to my library's performance

Igor Tandetnik-2
In reply to this post by Jens Alfke-2
On 4/28/2017 4:37 PM, Jens Alfke wrote:
> CREATE TABLE docs (doc_id INTEGER PRIMARY KEY, docid TEXT UNIQUE NOT NULL, expiry_timestamp INTEGER);
> CREATE INDEX docs_docid ON docs(docid);

For the record, this index is redundant. There's already an
automatically created index on docs(docid), thanks to UNIQUE clause.
This might be what confuses SQLite. See whether the behavior changes if
you drop it.
--
Igor Tandetnik

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Query plan gone haywire lays waste to my library's performance

Keith Medcalf
In reply to this post by Simon Slavin-3

The syntactic sugar of the ON clause does nothing for equijoins.  Only for outer joins.  Please RTFM:


See https://sqlite.org/queryplanner.html
And https://sqlite.org/optoverview.html
And https://sqlite.org/queryplanner-ng.html


--
˙uʍop-ǝpısdn sı ɹoʇıuoɯ ɹnoʎ 'sıɥʇ pɐǝɹ uɐɔ noʎ ɟı

> -----Original Message-----
> From: sqlite-users [mailto:[hidden email]]
> On Behalf Of Simon Slavin
> Sent: Friday, 28 April, 2017 15:00
> To: SQLite mailing list
> Subject: Re: [sqlite] Query plan gone haywire lays waste to my library's
> performance
>
> On 28 Apr 2017, at 9:49pm, Jens Alfke <[hidden email]> wrote:
>
> > SELECT revs.doc_id, sequence, docid, revid, json, deleted FROM revs,
> docs WHERE sequence>? AND current!=0 AND deleted=0 AND revs.doc_id =
> docs.doc_id ORDER BY revs.doc_id, deleted, revid DESC
> > 0 0 0 SCAN TABLE revs USING INDEX revs_by_docid_revid
> > 0 1 1 SEARCH TABLE docs USING INTEGER PRIMARY KEY (rowid=?)
> >
> > Another query gets it wrong too:
> >
> > SELECT sequence, revs.doc_id, docid, revid, deleted  FROM revs, docs
> WHERE sequence > ? AND current=1 AND revs.doc_id = docs.doc_id ORDER BY
> revs.doc_id, deleted, revid DESC
> >
> > 0 0 0 SCAN TABLE revs USING COVERING INDEX revs_current
> > 0 1 1 SEARCH TABLE docs USING INTEGER PRIMARY KEY (rowid=?)
> > 0 0 0 USE TEMP B-TREE FOR RIGHT PART OF ORDER BY
>
> What indexes do you have on these two tables ?  I can’t recommend one
> without knowing which columns belong to which tables.
>
> Once you have the indexes you want, run ANALYZE.  Or rather, make sure
> that the customers’ database files have had ANALYZE run on them while they
> contain typical data.
>
> First query rewritten for clarity:
>
> SELECT revs.doc_id, sequence, docid, revid, json, deleted FROM revs, docs
> WHERE sequence>? AND current!=0 AND deleted=0
> AND revs.doc_id = docs.doc_id
> ORDER BY revs.doc_id, deleted, revid DESC
>
> Should be more like
>
> SELECT revs.doc_id, sequence, docid, revid, json, deleted FROM revs
> JOIN docs ON docs.doc_id = revs.doc_id
> WHERE sequence>? AND current!=0 AND deleted=0
> ORDER BY revs.doc_id, deleted, revid DESC
>
> Second query should be more like:
>
> SELECT revs.doc_id, sequence, docid, revid, deleted FROM revs
> JOIN docs ON revs.doc_id = docs.doc_id
> WHERE sequence > ? AND current=1
> ORDER BY revs.doc_id, deleted, revid DESC
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Query plan gone haywire lays waste to my library's performance

Keith Medcalf
In reply to this post by Simon Slavin-3
On Friday, 28 April, 2017 15:55, Simon Slavin <[hidden email]> wrote:

> > The only difference is the explicit JOIN statement. I was under the
> > impression that using this, vs. the way I wrote it, is a matter of taste
> > that doesn’t affect the execution of the query.
 
> SQLite computes two-table searches using nested loops.  Providing the JOIN
> you want tells SQLite which order you think the loops should be nested in.
> Expressing the searches with JOINs is also helps me figure out what SQLite
> is actually doing.  Before I did that it wasn’t obvious to me how little
> the table "docs" mattered for this query.

NO IT DOES NOT.

FROM <table1> JOIN <<table2> ON <expression>

IS EXACTLY THE SAME AS

FROM <table1>, <table2> WHERE <expression>

THERE IS NOT DIFFERENCE WHATSOEVER.  EVER.  PERIOD.

It is also exactly the same as

FROM <table1> JOIN <table2> WHERE <expression>
FROM <table2> JOIN <table1> WHERE <expression>
FROM <table2> JOIN <table1> ON <expression>
FROM <table2>, <table1> WHERE <expression>

The *ONLY TIME EVER THAT "ON <expression>" is not IDENTICAL to leaving out the JOIN keyword and putting the <expression> in the WHERE clause is the case of an OUTER (LEFT) JOIN.  In all other cases they are just different spellings of exactly the same thing.  The other somewhat limited exception is the CROSS JOIN which specifies the visitation order where the LHS table is the outer loop and the RHS table is the inner loop, however, in that case the ON clause is identical to a WHERE clause and is nothing more than alternate spelling (just like we spell colour properly here, but the Yangs spell it color).





_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Loading...