Optimising query with aggregate in subselect.

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Optimising query with aggregate in subselect.

Andy Bennett
Hi,

I'm trying to implement a "streaming" version of the classic "select the
latest version of a record" query.


By "streaming" I mean a query that executes by streaming what it needs out
of tables and indexes as it needs it rather than using temporary b-trees or
materializing anything up front.

I'm looking for a query that I can run and then just consume as many
results as I want without worrying about the size of the entire result set.


Here's the schema I'm working with:

-----
CREATE TABLE "entrys" ("log-id" INTEGER NOT NULL , "entry-number" INTEGER
NOT NULL , "region" TEXT NOT NULL , "key" TEXT NOT NULL , "timestamp"
INTEGER NOT NULL , PRIMARY KEY ("log-id", "entry-number"))

CREATE UNIQUE INDEX "entrys-log-id-region-key-entry-number" ON "entrys" (
"log-id" ASC, "region" ASC, "key" ASC, "entry-number" ASC)
-----

There's only a couple of million rows in "entrys" and my query times are
into 2 or 3 seconds of startup time before the first row is returned.



Here's my query:

-----
-- explain query plan
SELECT
"entrys"."log-id"       AS "log-id",
"entrys"."entry-number" AS "entry-number",
"entrys"."region"       AS "region",
"entrys"."key"          AS "key",
"entrys"."timestamp"    AS "timestamp"

FROM
        (SELECT
        MAX("entry-number") AS "entry-number",
        "key"
        FROM "entrys"
        WHERE
        "log-id" = 1 AND
        "region" = "user" AND
        "entry-number" <= 1700108
        AND key > "G"
        GROUP BY "key"
        ORDER BY "key" DESC
        limit 20 -- (1)
        ) AS "specific-entrys"

INNER JOIN "entrys"
ON
1 = "entrys"."log-id" AND
"specific-entrys"."key" = "entrys"."key" AND
"user" = "entrys"."region" AND
"specific-entrys"."entry-number" = "entrys"."entry-number"
AND "entrys"."key" > "G"

WHERE
"entrys"."log-id" = 1

ORDER BY "key" ASC
;
-----

...which has this query plan in SQLite verson 3.31.0

-----
QUERY PLAN
|--MATERIALIZE 1
|  `--SEARCH TABLE entrys USING COVERING INDEX
entrys-log-id-region-key-entry-number (log-id=? AND region=? AND key<?)
|--SEARCH TABLE entrys USING INDEX entrys-log-id-region-key-entry-number
(log-id=? AND region=? AND key<?)
`--SEARCH SUBQUERY 1 AS specific-entrys USING AUTOMATIC COVERING INDEX
(key=?)
-----


My problem is with the MATERIALIZE.

The query produces just shy of 2 million rows.

It takes several seconds to start up but is then pretty quick when fetching
each row.

What I want to do is get rid of the startup costs so that I can paginate it
efficiently.

If I run the subselect on its own then there is no startup cost. The
results just get streamed straight out of the index.
Its query plan is

-----
QUERY PLAN
`--SEARCH TABLE entrys USING COVERING INDEX
entrys-log-id-region-key-entry-number (log-id=? AND region=? AND key<?)
-----


Is there anything I can do to make the original version of the query stream
the results in this way?

The best I have come up with is to insert a LIMIT clause at the point
denoted with "-- (1)". This keeps the subselect small and then
materialising the subselect and generating the automatic covering index
becomes cheap.

For pagination I then feed in the key from the last row of the previous
batch at the points denoted with "-- (2)" and "-- (3)".

If I do this then it seems equally cheap to access batches at the start and
end of the complete result set. The per-query cost is determined by the
batch size as set but the LIMIT clause.


However, I have been under the impression that LIMIT is supposed to be a
"debugging" extension to the language and not recommended for use in
queries that end up in one's program. LIMIT also only hides the latency; it
amortises it over each batch, but I still end up with memory requirements
in each "client" thread that are larger than I'd like; ideally I'd just
store the current row.



Thanks for any tips!




Best wishes,
@ndy

--
[hidden email]
http://www.ashurst.eu.org/
0x7EBA75FF
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Optimising query with aggregate in subselect.

Andy Bennett
Hi,

> ORDER BY "key" DESC

This should be ASC, not DESC: I've been working on versions of the query
that can go forwards and backwards and made an editor snafu when writing
the eMail.



Best wishes,
@ndy

--
[hidden email]
http://www.ashurst.eu.org/
0x7EBA75FF
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Optimising query with aggregate in subselect.

Simon Slavin-3
In reply to this post by Andy Bennett
On 20 Nov 2019, at 4:49pm, Andy Bennett <[hidden email]> wrote:

> INNER JOIN "entrys"
> ON
> 1 = "entrys"."log-id" AND
> "specific-entrys"."key" = "entrys"."key" AND
> "user" = "entrys"."region" AND
> "specific-entrys"."entry-number" = "entrys"."entry-number"
> AND "entrys"."key" > "G"

I can't solve your problem, but the PRIMARY KEY for "entrys" is

("log-id", "entry-number")

.  You shouldn't need to match so many different fields when just two of them, which you already have values for, narrow your search down to a single row.  Though I may be missing something.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Optimising query with aggregate in subselect.

Andy Bennett
Hi,

>> INNER JOIN "entrys"
>> ON
>> 1 = "entrys"."log-id" AND
>> "specific-entrys"."key" = "entrys"."key" AND
>> "user" = "entrys"."region" AND
>> "specific-entrys"."entry-number" = "entrys"."entry-number"
>> AND "entrys"."key" > "G"
>
> I can't solve your problem, but the PRIMARY KEY for "entrys" is
>
> ("log-id", "entry-number")
>
> .  You shouldn't need to match so many different fields when
> just two of them, which you already have values for, narrow your
> search down to a single row.  Though I may be missing something.

In past attempts at improving query performance these have been added to
encourage it to use an index that it can do a SCAN thru' rather than the
table that it would need to do a SEARCH thru'.

I'm pretty happy with the indexes it's currently choosing (apart from the
MATERIALIZE). Adding a covering index on timestamp theoretically improves
things but doesn't seem to make a (measurable) difference in practice with
current data sizes.




Best wishes,
@ndy

--
[hidden email]
http://www.ashurst.eu.org/
0x7EBA75FF
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Optimising query with aggregate in subselect.

Simon Slavin-3
On 20 Nov 2019, at 6:11pm, Andy Bennett <[hidden email]> wrote:

> In past attempts at improving query performance these have been added to encourage it to use an index that it can do a SCAN thru' rather than the table that it would need to do a SEARCH thru'.

SQLite is not using the PRIMARY INDEX to immediately locate the appropriate row, but is actually faster when you fake it into using a longer index ?  That's weird.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Optimising query with aggregate in subselect.

Keith Medcalf
In reply to this post by Andy Bennett

Did you try retrieving the data "directly" or do you need the subselect in order to maintain compatibility with other SQL dialects that are no longer able to retrieve data from the row on which the max was found?

CREATE TABLE entrys
(
    logid       INTEGER NOT NULL,
    entrynumber INTEGER NOT NULL,
    region      TEXT NOT NULL,
    key         TEXT NOT NULL,
    timestamp   INTEGER NOT NULL,
    PRIMARY KEY (logid, entrynumber)
);

CREATE INDEX a on entrys (region, logid, key, entrynumber);

  SELECT entrys.logid            AS logid,
         max(entrys.entrynumber) AS entrynumber,
         entrys.region           AS region,
         entrys.key              AS key,
         entrys.timestamp        AS timestamp
    FROM entrys
   WHERE entrys.region = ?
     AND entrys.key > ?
     AND entrys.logid = ?
GROUP BY key
;

NB:  I changed the ill-conceived column names to ones that do not require quoting and the identifier quoted items that are not column names with parameter markers.

--
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.

>-----Original Message-----
>From: sqlite-users <[hidden email]> On
>Behalf Of Andy Bennett
>Sent: Wednesday, 20 November, 2019 09:49
>To: [hidden email]
>Subject: [sqlite] Optimising query with aggregate in subselect.
>
>Hi,
>
>I'm trying to implement a "streaming" version of the classic "select the
>latest version of a record" query.
>
>
>By "streaming" I mean a query that executes by streaming what it needs
>out
>of tables and indexes as it needs it rather than using temporary b-trees
>or
>materializing anything up front.
>
>I'm looking for a query that I can run and then just consume as many
>results as I want without worrying about the size of the entire result
>set.
>
>
>Here's the schema I'm working with:
>
>-----
>CREATE TABLE "entrys" ("log-id" INTEGER NOT NULL , "entry-number" INTEGER
>NOT NULL , "region" TEXT NOT NULL , "key" TEXT NOT NULL , "timestamp"
>INTEGER NOT NULL , PRIMARY KEY ("log-id", "entry-number"))
>
>CREATE UNIQUE INDEX "entrys-log-id-region-key-entry-number" ON "entrys" (
>"log-id" ASC, "region" ASC, "key" ASC, "entry-number" ASC)
>-----
>
>There's only a couple of million rows in "entrys" and my query times are
>into 2 or 3 seconds of startup time before the first row is returned.
>
>
>
>Here's my query:
>
>-----
>-- explain query plan
>SELECT
>"entrys"."log-id"       AS "log-id",
>"entrys"."entry-number" AS "entry-number",
>"entrys"."region"       AS "region",
>"entrys"."key"          AS "key",
>"entrys"."timestamp"    AS "timestamp"
>
>FROM
> (SELECT
> MAX("entry-number") AS "entry-number",
> "key"
> FROM "entrys"
> WHERE
> "log-id" = 1 AND
> "region" = "user" AND
> "entry-number" <= 1700108
> AND key > "G"
> GROUP BY "key"
> ORDER BY "key" DESC
> limit 20 -- (1)
> ) AS "specific-entrys"
>
>INNER JOIN "entrys"
>ON
>1 = "entrys"."log-id" AND
>"specific-entrys"."key" = "entrys"."key" AND
>"user" = "entrys"."region" AND
>"specific-entrys"."entry-number" = "entrys"."entry-number"
>AND "entrys"."key" > "G"
>
>WHERE
>"entrys"."log-id" = 1
>
>ORDER BY "key" ASC
>;
>-----
>
>...which has this query plan in SQLite verson 3.31.0
>
>-----
>QUERY PLAN
>|--MATERIALIZE 1
>|  `--SEARCH TABLE entrys USING COVERING INDEX
>entrys-log-id-region-key-entry-number (log-id=? AND region=? AND key<?)
>|--SEARCH TABLE entrys USING INDEX entrys-log-id-region-key-entry-number
>(log-id=? AND region=? AND key<?)
>`--SEARCH SUBQUERY 1 AS specific-entrys USING AUTOMATIC COVERING INDEX
>(key=?)
>-----
>
>
>My problem is with the MATERIALIZE.
>
>The query produces just shy of 2 million rows.
>
>It takes several seconds to start up but is then pretty quick when
>fetching
>each row.
>
>What I want to do is get rid of the startup costs so that I can paginate
>it
>efficiently.
>
>If I run the subselect on its own then there is no startup cost. The
>results just get streamed straight out of the index.
>Its query plan is
>
>-----
>QUERY PLAN
>`--SEARCH TABLE entrys USING COVERING INDEX
>entrys-log-id-region-key-entry-number (log-id=? AND region=? AND key<?)
>-----
>
>
>Is there anything I can do to make the original version of the query
>stream
>the results in this way?
>
>The best I have come up with is to insert a LIMIT clause at the point
>denoted with "-- (1)". This keeps the subselect small and then
>materialising the subselect and generating the automatic covering index
>becomes cheap.
>
>For pagination I then feed in the key from the last row of the previous
>batch at the points denoted with "-- (2)" and "-- (3)".
>
>If I do this then it seems equally cheap to access batches at the start
>and
>end of the complete result set. The per-query cost is determined by the
>batch size as set but the LIMIT clause.
>
>
>However, I have been under the impression that LIMIT is supposed to be a
>"debugging" extension to the language and not recommended for use in
>queries that end up in one's program. LIMIT also only hides the latency;
>it
>amortises it over each batch, but I still end up with memory requirements
>in each "client" thread that are larger than I'd like; ideally I'd just
>store the current row.
>
>
>
>Thanks for any tips!
>
>
>
>
>Best wishes,
>@ndy
>
>--
>[hidden email]
>http://www.ashurst.eu.org/
>0x7EBA75FF
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Optimising query with aggregate in subselect.

Andy Bennett
Hi,

> Did you try retrieving the data "directly" or do you need the
> subselect in order to maintain compatibility with other SQL
> dialects that are no longer able to retrieve data from the row
> on which the max was found?

Thanks Keith!

I understood that selecting other columns during an aggregate lead to
ill-specific or undefined values in those columns. Does SQLite make more
guarantees than the SQL standard here? Do you have a pointer to the docs as
I tried and failed to find it in there.



>
> CREATE TABLE entrys
> (
>     logid       INTEGER NOT NULL,
>     entrynumber INTEGER NOT NULL,
>     region      TEXT NOT NULL,
>     key         TEXT NOT NULL,
>     timestamp   INTEGER NOT NULL,
>     PRIMARY KEY (logid, entrynumber)
> );
>
> CREATE INDEX a on entrys (region, logid, key, entrynumber);
>
>   SELECT entrys.logid            AS logid,
>          max(entrys.entrynumber) AS entrynumber,
>          entrys.region           AS region,
>          entrys.key              AS key,
>          entrys.timestamp        AS timestamp
>     FROM entrys
>    WHERE entrys.region = ?
>      AND entrys.key > ?
>      AND entrys.logid = ?
> GROUP BY key
> ;
>
> NB:  I changed the ill-conceived column names to ones that do
> not require quoting and the identifier quoted items that are not
> column names with parameter markers.





Best wishes,
@ndy

--
[hidden email]
http://www.ashurst.eu.org/
0x7EBA75FF
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Optimising query with aggregate in subselect.

Andy Bennett
In reply to this post by Simon Slavin-3
Hi,

>> In past attempts at improving query performance these have
>> been added to encourage it to use an index that it can do a
>> SCAN thru' rather than the table that it would need to do a
>> SEARCH thru'.
>
> SQLite is not using the PRIMARY INDEX to immediately locate the
> appropriate row, but is actually faster when you fake it into
> using a longer index ?  That's weird.

There is more than one row returned. Potentially several million.

...so it's quicker to do a linear scan thru' something that's in the same
order (table or index) rather than a series of tree accesses for each and
every row.



Best wishes,
@ndy

--
[hidden email]
http://www.ashurst.eu.org/
0x7EBA75FF
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Optimising query with aggregate in subselect.

Merijn Verstraaten
In reply to this post by Andy Bennett


> On 20 Nov 2019, at 20:37, Andy Bennett <[hidden email]> wrote:
>
> Hi,
>
>> Did you try retrieving the data "directly" or do you need the subselect in order to maintain compatibility with other SQL dialects that are no longer able to retrieve data from the row on which the max was found?
>
> Thanks Keith!
>
> I understood that selecting other columns during an aggregate lead to ill-specific or undefined values in those columns. Does SQLite make more guarantees than the SQL standard here? Do you have a pointer to the docs as I tried and failed to find it in there.

There's a small sidenote (that I'm too lazy too find right now) in the select docs that mentions that, in case of using min or max as aggregate, the non-aggregate columns will come from the row that held the min/max value.

Kind regards,
Merijn

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Optimising query with aggregate in subselect.

David Raymond
"There's a small sidenote (that I'm too lazy too find right now) in the select docs that mentions that, in case of using min or max as aggregate, the non-aggregate columns will come from the row that held the min/max value."


Look in
https://www.sqlite.org/quirks.html
under "6. Aggregate Queries Can Contain Non-Aggregate Result Columns That Are Not In The GROUP BY Clause"

and also in
https://www.sqlite.org/lang_select.html
In section 3 search for: "Side note: Bare columns in an aggregate queries."
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Optimising query with aggregate in subselect.

Keith Medcalf
In reply to this post by Andy Bennett

Yes.  See under item #3 in the Side note on https://sqlite.org/lang_select.html

Special processing occurs when the aggregate function is either min() or max(). Example:

    SELECT a, b, max(c) FROM tab1 GROUP BY a;

When the min() or max() aggregate functions are used in an aggregate query, all bare columns in the result set take values from the input row which also contains the minimum or maximum. So in the query above, the value of the "b" column in the output will be the value of the "b" column in the input row that has the largest "c" value. There is still an ambiguity if two or more of the input rows have the same minimum or maximum value or if the query contains more than one min() and/or max() aggregate function. Only the built-in min() and max() functions work this way.

Many other RDBMS used to do this (ie, Sybase which became MS SQL Server, amongst others).  This was "done away with" in most other SQL implementations and is indeed "non-standard" behaviour.  The standard generally requires that items in the select list either be aggregates or in the group by list.

This is a specific implementation detail/feature of SQLite3 and will (probably) not be portable elsewhere ...

--
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.

>-----Original Message-----
>From: sqlite-users <[hidden email]> On
>Behalf Of Andy Bennett
>Sent: Wednesday, 20 November, 2019 12:37
>To: [hidden email]
>Subject: Re: [sqlite] Optimising query with aggregate in subselect.
>
>Hi,
>
>> Did you try retrieving the data "directly" or do you need the
>> subselect in order to maintain compatibility with other SQL
>> dialects that are no longer able to retrieve data from the row
>> on which the max was found?
>
>Thanks Keith!
>
>I understood that selecting other columns during an aggregate lead to
>ill-specific or undefined values in those columns. Does SQLite make more
>guarantees than the SQL standard here? Do you have a pointer to the docs
>as
>I tried and failed to find it in there.
>
>
>
>>
>> CREATE TABLE entrys
>> (
>>     logid       INTEGER NOT NULL,
>>     entrynumber INTEGER NOT NULL,
>>     region      TEXT NOT NULL,
>>     key         TEXT NOT NULL,
>>     timestamp   INTEGER NOT NULL,
>>     PRIMARY KEY (logid, entrynumber)
>> );
>>
>> CREATE INDEX a on entrys (region, logid, key, entrynumber);
>>
>>   SELECT entrys.logid            AS logid,
>>          max(entrys.entrynumber) AS entrynumber,
>>          entrys.region           AS region,
>>          entrys.key              AS key,
>>          entrys.timestamp        AS timestamp
>>     FROM entrys
>>    WHERE entrys.region = ?
>>      AND entrys.key > ?
>>      AND entrys.logid = ?
>> GROUP BY key
>> ;
>>
>> NB:  I changed the ill-conceived column names to ones that do
>> not require quoting and the identifier quoted items that are not
>> column names with parameter markers.
>
>
>
>
>
>Best wishes,
>@ndy
>
>--
>[hidden email]
>http://www.ashurst.eu.org/
>0x7EBA75FF
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Optimising query with aggregate in subselect.

Richard Damon
In reply to this post by Simon Slavin-3
On 11/20/19 1:26 PM, Simon Slavin wrote:
> On 20 Nov 2019, at 6:11pm, Andy Bennett <[hidden email]> wrote:
>
>> In past attempts at improving query performance these have been added to encourage it to use an index that it can do a SCAN thru' rather than the table that it would need to do a SEARCH thru'.
> SQLite is not using the PRIMARY INDEX to immediately locate the appropriate row, but is actually faster when you fake it into using a longer index ?  That's weird.

I think the issue is that the 'Primary Index' isn't really the primary index, but that the implied ROWID (since the table isn't WITHOUT ROWID, and the Primary Key isn't INTEGER PRIMARY KEY) will be the primary key. Thus a lookup of a row with his declared primary key is a two step lookup, find the key in the Unique index for the key, and then lookup the specified record by ROWID, if there is a index with the needed data, then the second lookup isn't needed.

--
Richard Damon

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Optimising query with aggregate in subselect.

Andy Bennett
In reply to this post by David Raymond
Hi,

Thanks to everyone who helped with this!

I'll try some stuff out and see if I can get things efficient, fast *and*
simple.

:-)


> "There's a small sidenote (that I'm too lazy too find right
> now) in the select docs that mentions that, in case of using min
> or max as aggregate, the non-aggregate columns will come from
> the row that held the min/max value."
>
>
> Look in
> https://www.sqlite.org/quirks.html
> under "6. Aggregate Queries Can Contain Non-Aggregate Result
> Columns That Are Not In The GROUP BY Clause"
>
> and also in
> https://www.sqlite.org/lang_select.html
> In section 3 search for: "Side note: Bare columns in an aggregate queries."

--
Best wishes,
@ndy

--
[hidden email]
http://www.ashurst.eu.org/
0x7EBA75FF
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users