Slow SQL statements

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Slow SQL statements

Ludvig Strigeus-2
Hello,

Can someone come up with some slow SQL statements (that execute in
like 2 seconds) that are not disk bound but CPU bound. Preferably
single liners.

I'm playing around with a profiler and trying to find bottlenecks in
sqlite and optimizing them.

/Ludvig
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Slow SQL statements

Al Danial
I created an SQLite database of baseball stats (based on
data pulled from http://baseball1.info/) to show off SQLite
at a presentation yesterday.  There's enough data (over
340,000 total rows in 21 tables) to make for some pretty
slow queries.  Download the SQL statements and set up
the database with

wget http://danial.org/sqlite/lampsig/baseball.sql.bz2
bunzip2 -dc baseball.sql.bz2 | sqlite baseball.db
sqlite baseball.db 'create index i1 on master(playerid);'
sqlite baseball.db 'create index i2 on fielding(playerid);'
sqlite baseball.db 'create index i3 on batting(playerid);'
sqlite baseball.db 'create index i4 on fielding(lgid);'

then try queries such as

sqlite baseball.db 'select playerid,sb from batting where sb = (select
max(sb) from batting where playerid in (select playerid from fielding
where pos = "3B" and lgid = "NL"));'

which tries to answer the question "which National League third
baseman had the most stolen bases in a season?"  Then again,
the query could be slow because I didn't formulate it correctly.     -- Al


On 5/22/05, Ludvig Strigeus <[hidden email]> wrote:

> Hello,
>
> Can someone come up with some slow SQL statements (that execute in
> like 2 seconds) that are not disk bound but CPU bound. Preferably
> single liners.
>
> I'm playing around with a profiler and trying to find bottlenecks in
> sqlite and optimizing them.
>
> /Ludvig
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Slow SQL statements

D. Richard Hipp
On Sun, 2005-05-22 at 19:42 -0400, Al Danial wrote:

> then try queries such as
>
> sqlite baseball.db 'select playerid,sb from batting where sb = (select
> max(sb) from batting where playerid in (select playerid from fielding
> where pos = "3B" and lgid = "NL"));'
>
> which tries to answer the question "which National League third
> baseman had the most stolen bases in a season?"  Then again,
> the query could be slow because I didn't formulate it correctly.     -- Al
>

The use of double-quotes instead of single-quotes is confusing
the optimizer somehow.  (I'm still looking why this is.)  SQL
(not just SQLite but the SQL language in general) really wants
you to use single quotes.  If you change "NL" to 'NL' the query
will go very quickly, even without creating any indices.
--
D. Richard Hipp <[hidden email]>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Slow SQL statements

D. Richard Hipp
On Sun, 2005-05-22 at 22:02 -0400, D. Richard Hipp wrote:
> On Sun, 2005-05-22 at 19:42 -0400, Al Danial wrote:
> >
> > sqlite baseball.db 'select playerid,sb from batting where sb = (select
> > max(sb) from batting where playerid in (select playerid from fielding
> > where pos = "3B" and lgid = "NL"));'
> >
>
> The use of double-quotes instead of single-quotes is confusing
> the optimizer somehow.  (I'm still looking why this is.)

I think I see what is going on now.  For clarity of presentation,
allow me to reformat the query:

   SELECT playerid, sb FROM batting WHERE sb =
     (SELECT max(sb) FROM batting WHERE playerid IN
        (SELECT player FROM fielding WHERE pos="3B" AND lgid="NL"));

There are three nested queries.  The outer query uses a subquery in
its WHERE clause.  The WHERE clause of the outer query is
essentially "sb = <subquery>".  This subquery in turn as a subquery
of its WHERE clause.

The problem arises when SQLite tries to resolve the name "NL".
The rules of SQL require that a string in double-quotes should
first try to resolve to a column name.  If that fails, then treat
then string as a string literal.  This is what SQLite does:

   (1) First it tries to resolve the name "NL" in the context
       of the inner-most query.  But NL is not a valid identifier
       there so it fails.
   (2) Next it tries to resolve the name NL in the middle query.
       Once again, NL is not a valid identifier there so resolution
       fails again.
   (3) Then it tries to resolve the name NL in the outer query.
       As before, this fails.
   (4) Having not been able to resolve the name in any context,
       SQLite uses "NL" as a string literal.

Because of steps 2 and 3 - because SQLite had to look into outer
queries trying to resolve a name in the inner-most subquery, it
assumes that the inner-most subquery was a correlated subquery
and would hence need to be reevaluated once for each row in the
middle and outer queries.  This results in O(N^3) performance,
regardless of what indices you use.

By changing "NL" and "3B" to 'NL' and '3B', steps 1 through 3
in the name resolution process are avoided.  SQLite immediately
recognizes that the inner subquery is fixed, evaluates it exactly
once and saves the result.  This results in O(N) runtime performance.
Much, much faster.
--
D. Richard Hipp <[hidden email]>

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Slow SQL statements

Ludvig Strigeus-2
Why doesn't SQL provide a utility function: maxv

Then you could (almost) write it like this:
SELECT maxv(sb, playerid) FROM batting WHERE playerid IN
       (SELECT player FROM fielding WHERE pos='3B' AND lgid='NL'));

The semantics of maxv(arg, value) would be that it finds the maximum
of arg, but instead of returning arg, it returns value for the record
with the maximum arg.

Maybe it could be possible to make it a multi argument function:
maxv(arg, value, value2)
which would return 2 columns, value and value2.

Or alternatively, make a function maxid(arg) that returns the rowid of
the maximum arg. (But this can be implemented as maxv(arg, rowid) )

How about adding such a function to SQLite? It seems useful.

/Ludvig


On 5/23/05, D. Richard Hipp <[hidden email]> wrote:

> On Sun, 2005-05-22 at 22:02 -0400, D. Richard Hipp wrote:
> > On Sun, 2005-05-22 at 19:42 -0400, Al Danial wrote:
> > >
> > > sqlite baseball.db 'select playerid,sb from batting where sb = (select
> > > max(sb) from batting where playerid in (select playerid from fielding
> > > where pos = "3B" and lgid = "NL"));'
> > >
> >
> > The use of double-quotes instead of single-quotes is confusing
> > the optimizer somehow.  (I'm still looking why this is.)
>
> I think I see what is going on now.  For clarity of presentation,
> allow me to reformat the query:
>
>    SELECT playerid, sb FROM batting WHERE sb =
>      (SELECT max(sb) FROM batting WHERE playerid IN
>         (SELECT player FROM fielding WHERE pos="3B" AND lgid="NL"));
>
> There are three nested queries.  The outer query uses a subquery in
> its WHERE clause.  The WHERE clause of the outer query is
> essentially "sb = <subquery>".  This subquery in turn as a subquery
> of its WHERE clause.
>
> The problem arises when SQLite tries to resolve the name "NL".
> The rules of SQL require that a string in double-quotes should
> first try to resolve to a column name.  If that fails, then treat
> then string as a string literal.  This is what SQLite does:
>
>    (1) First it tries to resolve the name "NL" in the context
>        of the inner-most query.  But NL is not a valid identifier
>        there so it fails.
>    (2) Next it tries to resolve the name NL in the middle query.
>        Once again, NL is not a valid identifier there so resolution
>        fails again.
>    (3) Then it tries to resolve the name NL in the outer query.
>        As before, this fails.
>    (4) Having not been able to resolve the name in any context,
>        SQLite uses "NL" as a string literal.
>
> Because of steps 2 and 3 - because SQLite had to look into outer
> queries trying to resolve a name in the inner-most subquery, it
> assumes that the inner-most subquery was a correlated subquery
> and would hence need to be reevaluated once for each row in the
> middle and outer queries.  This results in O(N^3) performance,
> regardless of what indices you use.
>
> By changing "NL" and "3B" to 'NL' and '3B', steps 1 through 3
> in the name resolution process are avoided.  SQLite immediately
> recognizes that the inner subquery is fixed, evaluates it exactly
> once and saves the result.  This results in O(N) runtime performance.
> Much, much faster.
> --
> D. Richard Hipp <[hidden email]>
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Slow SQL statements

Andrew Piskorski
On Mon, May 23, 2005 at 08:00:22AM +0200, Ludvig Strigeus wrote:
> Why doesn't SQL provide a utility function: maxv
>
> Then you could (almost) write it like this:
> SELECT maxv(sb, playerid) FROM batting WHERE playerid IN
>        (SELECT player FROM fielding WHERE pos='3B' AND lgid='NL'));
>
> The semantics of maxv(arg, value) would be that it finds the maximum
> of arg, but instead of returning arg, it returns value for the record
> with the maximum arg.

I always thought of that feature as "look-aside", as in, "Find the max
of foo, but don't give me that, instead look aside and return me the
corresponding value of bar next to it instead."  I've wanted that
feature for years, and then not long ago discovered, buried in the
Oracle docs, that it exists in Oracle 9i and up!  Since you want the
playerid with the most sb (stolen bases), in Oracle 9.2.x it would be
something like this:

select
  max(playerid) keep (dense_rank last order by sb) as top_stealer
from batting

I don't know whether the SQL Standard includes any such feature, but
it sure is fabulously useful.  With a simple queries like these you
can make it work without that feature, but when you're in the middle
of a complex query that's already a page or two long, without this
sort of feature, there's often no sane way to do it at all.

--
Andrew Piskorski <[hidden email]>
http://www.piskorski.com/
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Slow SQL statements

D. Richard Hipp
In reply to this post by D. Richard Hipp
On Sun, 2005-05-22 at 22:02 -0400, D. Richard Hipp wrote:

> On Sun, 2005-05-22 at 19:42 -0400, Al Danial wrote:
> > then try queries such as
> >
> > sqlite baseball.db 'select playerid,sb from batting where sb = (select
> > max(sb) from batting where playerid in (select playerid from fielding
> > where pos = "3B" and lgid = "NL"));'
> >
> > which tries to answer the question "which National League third
> > baseman had the most stolen bases in a season?"  Then again,
> > the query could be slow because I didn't formulate it correctly.     -- Al
> >
>
> The use of double-quotes instead of single-quotes is confusing
> the optimizer somehow.  (I'm still looking why this is.)  

This issue has now been resolved.  Using the code in CVS, the
query above should be quite fast, even without changing the
"NL" to 'NL'.
--
D. Richard Hipp <[hidden email]>

Loading...