Skip-scan optimization

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Skip-scan optimization

Gerlando Falauto
Hi,

I read about the skip-scan optimization:

https://www.sqlite.org/optoverview.html#skipscan

is there a way to check whether it is being used for a given query, or not?

Explain query plan does not seem to give any insight...
I tried both before and after running ANALYZE; / DROP TABLE sqlite_stat1;
but it doesn't seem to make that much of a difference.

Thank you!
Gerlando
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Skip-scan optimization

Richard Hipp-3
On 1/24/19, Gerlando Falauto <[hidden email]> wrote:
> Hi,
>
> I read about the skip-scan optimization:
>
> https://www.sqlite.org/optoverview.html#skipscan
>
> is there a way to check whether it is being used for a given query, or not?
>
> Explain query plan does not seem to give any insight...

In the output from the script below, the "ANY(a)" in the EXPLAIN QUERY
PLAN output indicates that column "a" of the index is being skipped by
the skip-scan.

CREATE TABLE t1(a TEXT, b INT, c INT, d INT);
CREATE INDEX t1abc ON t1(a,b,c);
INSERT INTO t1 VALUES('abc',123,4,5);
INSERT INTO t1 VALUES('abc',234,5,6);
INSERT INTO t1 VALUES('abc',234,6,7);
INSERT INTO t1 VALUES('abc',345,7,8);
INSERT INTO t1 VALUES('def',567,8,9);
INSERT INTO t1 VALUES('def',345,9,10);
INSERT INTO t1 VALUES('bcd',100,6,11);
/* Fake the sqlite_stat1 table so that the query planner believes
** the table contains thousands of rows and that the first few
** columns are not selective. */
ANALYZE;
DELETE FROM sqlite_stat1;
INSERT INTO sqlite_stat1 VALUES('t1','t1abc','10000 5000 2000 10');
ANALYZE sqlite_master;

EXPLAIN QUERY PLAN
SELECT a,b,c,d,'|' FROM t1 WHERE d<>99 AND b=345 ORDER BY a;




> I tried both before and after running ANALYZE; / DROP TABLE sqlite_stat1;
> but it doesn't seem to make that much of a difference.
>
> Thank you!
> Gerlando
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>


--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Skip-scan optimization

Gerlando Falauto
Thank you!
There's one thing I don't understand though:
What is the purpose of ANALYZE sqlite_master; ?

Thank you!
Gerlando


On Thu, Jan 24, 2019 at 4:07 PM Richard Hipp <[hidden email]> wrote:

> On 1/24/19, Gerlando Falauto <[hidden email]> wrote:
> > Hi,
> >
> > I read about the skip-scan optimization:
> >
> > https://www.sqlite.org/optoverview.html#skipscan
> >
> > is there a way to check whether it is being used for a given query, or
> not?
> >
> > Explain query plan does not seem to give any insight...
>
> In the output from the script below, the "ANY(a)" in the EXPLAIN QUERY
> PLAN output indicates that column "a" of the index is being skipped by
> the skip-scan.
>
> CREATE TABLE t1(a TEXT, b INT, c INT, d INT);
> CREATE INDEX t1abc ON t1(a,b,c);
> INSERT INTO t1 VALUES('abc',123,4,5);
> INSERT INTO t1 VALUES('abc',234,5,6);
> INSERT INTO t1 VALUES('abc',234,6,7);
> INSERT INTO t1 VALUES('abc',345,7,8);
> INSERT INTO t1 VALUES('def',567,8,9);
> INSERT INTO t1 VALUES('def',345,9,10);
> INSERT INTO t1 VALUES('bcd',100,6,11);
> /* Fake the sqlite_stat1 table so that the query planner believes
> ** the table contains thousands of rows and that the first few
> ** columns are not selective. */
> ANALYZE;
> DELETE FROM sqlite_stat1;
> INSERT INTO sqlite_stat1 VALUES('t1','t1abc','10000 5000 2000 10');
> ANALYZE sqlite_master;
>
> EXPLAIN QUERY PLAN
> SELECT a,b,c,d,'|' FROM t1 WHERE d<>99 AND b=345 ORDER BY a;
>
>
>
>
> > I tried both before and after running ANALYZE; / DROP TABLE sqlite_stat1;
> > but it doesn't seem to make that much of a difference.
> >
> > Thank you!
> > Gerlando
> > _______________________________________________
> > sqlite-users mailing list
> > [hidden email]
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >
>
>
> --
> D. Richard Hipp
> [hidden email]
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Skip-scan optimization

Richard Hipp-3
On 1/24/19, Gerlando Falauto <[hidden email]> wrote:
> What is the purpose of ANALYZE sqlite_master; ?

Causes the content of sqlite_stat1 to be reloaded into the query
planner after making out-of-band changes using UPDATE.
--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Skip-scan optimization

Gerlando Falauto
Hi,

I see, thanks for the explanation.
I still don't understand how the whole skip-scan optimization works though.
My use-case involves a table pretty much like this one:

CREATE TABLE `rolling` (
    `source1`    TEXT NOT NULL,
    `source2`    TEXT NOT NULL,
    `ts`    INTEGER NOT NULL,
    `value`    TEXT
);

CREATE INDEX `sources` ON `rolling` (
    `source1`,
    `source2`,
    `ts`
);

INSERT INTO rolling
    WITH RECURSIVE
      src1( source1 ) AS ( VALUES("aaa") UNION ALL VALUES("bbb") ),
      src2( source2 ) AS ( VALUES("X1") UNION ALL VALUES("X2") UNION ALL
VALUES("X3") UNION ALL VALUES("X4") ),
      cnt( ts, value) AS (
      VALUES( 0, "ZZZZ")
        UNION ALL
      SELECT ts+1, value FROM cnt LIMIT 1000000)

    select src1.source1, src2.source2, cnt.* from src1, src2, cnt;


So we have 2*4*1M = 8M rows, all indexed.
The first two columns (source1, source2) just serve the purpose of
identifying the data source and therefore have a very limited set of
possible values.
That's a good case for skip-scan, isn't it?

Here's a session where I try to run some benchmarks (notice this is running
on a Raspberry pi with USB storage).

SQLite version 3.25.0 2018-09-15 04:01:47
Enter ".help" for usage hints.
sqlite> .eqp on
sqlite> .timer on
sqlite> select distinct source1,source2 from rolling;
QUERY PLAN
`--SCAN TABLE rolling USING COVERING INDEX sources
aaa|X1
aaa|X2
aaa|X3
aaa|X4
bbb|X1
bbb|X2
bbb|X3
bbb|X4
Run Time: real 6.767 user 6.200000 sys 0.390000

OK, so that's pretty slow.
So I run analyze:

sqlite> analyze;
Run Time: real 12.572 user 10.890000 sys 1.400000
sqlite> select distinct source1,source2 from rolling;
QUERY PLAN
`--SCAN TABLE rolling USING COVERING INDEX sources
aaa|X1
aaa|X2
aaa|X3
aaa|X4
bbb|X1
bbb|X2
bbb|X3
bbb|X4
Run Time: real 0.001 user 0.000000 sys 0.000000
sqlite>

That's quite an improvement! There's two things I don't understand though:
1) Why that needs the DB to be ANALYZEd first. That's just the first 2
columns of an existing index.
This optimization should always pay off, no matter how big the tables
2) Why the two query plans look identical (no indication of any skip-scan
kicking in)

Now I just want to get the first ts value for each (source1,source2) pair:

sqlite> select distinct source1,source2,min(ts) from rolling group by
source1, source2;
QUERY PLAN
|--SCAN TABLE rolling USING COVERING INDEX sources
`--USE TEMP B-TREE FOR DISTINCT
aaa|X1|0
aaa|X2|0
aaa|X3|0
aaa|X4|0
bbb|X1|0
bbb|X2|0
bbb|X3|0
bbb|X4|0
Run Time: real 11.394 user 10.650000 sys 0.630000
sqlite>

Here I would expect to query to be as fast as the previous one. After all,
ts is also part of the index, so min(ts) should just be the first value.
OK, so perhaps it's the distinct clause?

sqlite> select source1,source2,min(ts) from rolling group by source1,
source2;
QUERY PLAN
`--SCAN TABLE rolling USING COVERING INDEX sources
aaa|X1|0
aaa|X2|0
aaa|X3|0
aaa|X4|0
bbb|X1|0
bbb|X2|0
bbb|X3|0
bbb|X4|0
Run Time: real 10.760 user 10.360000 sys 0.330000

Not much of an improvement indeed. Perhaps ordering?

sqlite> select source1,source2,min(ts) from rolling group by source1,
source2 order by source1, source2;
QUERY PLAN
`--SCAN TABLE rolling USING COVERING INDEX sources
aaa|X1|0
aaa|X2|0
aaa|X3|0
aaa|X4|0
bbb|X1|0
bbb|X2|0
bbb|X3|0
bbb|X4|0
Run Time: real 10.699 user 10.410000 sys 0.260000

Nope. What if I try some seek?

sqlite> select source1,source2,min(ts) from rolling where ts>=999999 group
by source1, source2;
QUERY PLAN
`--SEARCH TABLE rolling USING COVERING INDEX sources (ANY(source1) AND
ANY(source2) AND ts>?)
aaa|X1|999999
aaa|X2|999999
aaa|X3|999999
aaa|X4|999999
bbb|X1|999999
bbb|X2|999999
bbb|X3|999999
bbb|X4|999999
Run Time: real 0.001 user 0.000000 sys 0.000000

YES! That's skip-scan kicking in, and it's finally telling me with the
ANY() clauses.
However, I still can't seem to get the min(ts) query to be optimized.
Perhaps I need some sub-query?
Any ideas?

Thank you!
Gerlando

On Thu, Jan 24, 2019 at 4:47 PM Richard Hipp <[hidden email]> wrote:

> On 1/24/19, Gerlando Falauto <[hidden email]> wrote:
> > What is the purpose of ANALYZE sqlite_master; ?
>
> Causes the content of sqlite_stat1 to be reloaded into the query
> planner after making out-of-band changes using UPDATE.
> --
> D. Richard Hipp
> [hidden email]
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users