Fastest way to SELECT on a set of keys?

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Fastest way to SELECT on a set of keys?

Jens Alfke-2
If I have a set of primary keys (let's say a few hundred) and need to fetch data from the table rows with those keys, what's the fastest way to do so? The options seem to be:

(a) Execute "SELECT … FROM table WHERE key=?", once for each key.
(b) Execute "SELECT key, … FROM table WHERE key IN (…)", including all of the key strings.

If I do (a), I can pre-prepare the statement and save the overhead of compilation. But SQLite has to go through the rest of its work (starting the virtual machine, b-tree lookup, etc.) once for each key.

If I do (b), SQLite has less setup work to do, and it could potentially optimize the b-tree lookup. On the downside, I have to prepare a statement every time since the RHS of an "IN" isn't substitutable.

Does anyone have intuition or actual knowledge about which approach is better? Or know of a 3rd better approach?

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Fastest way to SELECT on a set of keys?

Simon Slavin-3
On 13 Sep 2019, at 5:38pm, Jens Alfke <[hidden email]> wrote:

> Does anyone have intuition or actual knowledge about which approach is better? Or know of a 3rd better approach?

My guess is (b), but it will depend on your particular setup.  Depends on cache size, storage speed, whether your OS is real or virtualized, etc..  I don't think the overhead of preparation will cause much of a delay.

Solution (b) will require more memory than (a) since it has to keep the array of all keys in memory until the command is finished.

There is, of course, solution (c): read every row and check in your software whether it has one of the keys you want.  This requires preparing and executing one statement.  If your list of keys covers most of the rows this may be fastest.  And it uses the least memory.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Fastest way to SELECT on a set of keys?

Jose Isaias Cabrera-4
In reply to this post by Jens Alfke-2

Jens Alfke, on Friday, September 13, 2019 12:38 PM, wrote...

> (a) Execute "SELECT … FROM table WHERE key=?", once for each key.
> (b) Execute "SELECT key, … FROM table WHERE key IN (…)", including all of the key strings.

I have found that the ... IN ... has provided a much faster result than the previous one.  But, that is in my case.

josé
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Fastest way to SELECT on a set of keys?

Richard Hipp-3
In reply to this post by Jens Alfke-2
On 9/13/19, Jens Alfke <[hidden email]> wrote:

> If I have a set of primary keys (let's say a few hundred) and need to fetch
> data from the table rows with those keys, what's the fastest way to do so?
> The options seem to be:
>
> (a) Execute "SELECT … FROM table WHERE key=?", once for each key.
> (b) Execute "SELECT key, … FROM table WHERE key IN (…)", including all of
> the key strings.
>
> If I do (a), I can pre-prepare the statement and save the overhead of
> compilation. But SQLite has to go through the rest of its work (starting the
> virtual machine, b-tree lookup, etc.) once for each key.
>
> If I do (b), SQLite has less setup work to do, and it could potentially
> optimize the b-tree lookup. On the downside, I have to prepare a statement
> every time since the RHS of an "IN" isn't substitutable.
>
> Does anyone have intuition or actual knowledge about which approach is
> better? Or know of a 3rd better approach?

A third option is to use the carray-extension to create an IN query
that is substitutable.

https://www.sqlite.org/src/file/ext/misc/carray.c

--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Fastest way to SELECT on a set of keys?

Graham Holden
In reply to this post by Jens Alfke-2
Another possibility... INSERT the keys in a temporary table and do an appropriate JOIN.Sent from my Samsung Galaxy S7 - powered by Three
-------- Original message --------From: Simon Slavin <[hidden email]> Date: 13/09/2019  17:51  (GMT+00:00) To: SQLite mailing list <[hidden email]> Subject: Re: [sqlite] Fastest way to SELECT on a set of keys? On 13 Sep 2019, at 5:38pm, Jens Alfke <[hidden email]> wrote:> Does anyone have intuition or actual knowledge about which approach is better? Or know of a 3rd better approach?My guess is (b), but it will depend on your particular setup.  Depends on cache size, storage speed, whether your OS is real or virtualized, etc..  I don't think the overhead of preparation will cause much of a delay.Solution (b) will require more memory than (a) since it has to keep the array of all keys in memory until the command is finished.There is, of course, solution (c): read every row and check in your software whether it has one of the keys you want.  This requires preparing and executing one statement.  If your list of keys covers most of the rows this may be fastest.  And it uses the least memory._______________________________________________sqlite-users mailing [hidden email]://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Fastest way to SELECT on a set of keys?

Hick Gunter
In reply to this post by Jens Alfke-2
WITH list (key) AS (VALUES (<value),...) SELECT table.key, ... FROM list cross join table on (list.key = table.key);

This forces SQLite to use the list as the outer loop and perform a key lookup. This is faster if the number of keys in the list is small relative to the number of records in the table

If the number of keys is similar to the number of records in the table, then a simple full table scan may be faster.

-----Ursprüngliche Nachricht-----
Von: sqlite-users [mailto:[hidden email]] Im Auftrag von Jens Alfke
Gesendet: Freitag, 13. September 2019 18:39
An: SQLite mailing list <[hidden email]>
Betreff: [EXTERNAL] [sqlite] Fastest way to SELECT on a set of keys?

If I have a set of primary keys (let's say a few hundred) and need to fetch data from the table rows with those keys, what's the fastest way to do so? The options seem to be:

(a) Execute "SELECT … FROM table WHERE key=?", once for each key.
(b) Execute "SELECT key, … FROM table WHERE key IN (…)", including all of the key strings.

If I do (a), I can pre-prepare the statement and save the overhead of compilation. But SQLite has to go through the rest of its work (starting the virtual machine, b-tree lookup, etc.) once for each key.

If I do (b), SQLite has less setup work to do, and it could potentially optimize the b-tree lookup. On the downside, I have to prepare a statement every time since the RHS of an "IN" isn't substitutable.

Does anyone have intuition or actual knowledge about which approach is better? Or know of a 3rd better approach?

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


___________________________________________
 Gunter Hick | Software Engineer | Scientific Games International GmbH | Klitschgasse 2-4, A-1130 Vienna | FN 157284 a, HG Wien, DVR: 0430013 | (O) +43 1 80100 - 0

May be privileged. May be confidential. Please delete if not the addressee.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Fastest way to SELECT on a set of keys?

Keith Medcalf
In reply to this post by Jens Alfke-2

That depends greatly on the overhead you have for executing each select statement.  So I wrote a little test that uses my customized apsw library from Python 3.  It also works using the as-distributed sqlite3 wrapper (except for the carray interface, which requires my customized apsw to be able to build and pass the object).  The overheads associated with each method are included in the elapsed time.  The only thing that is clear is that where the overhead of executing each select is significant it is clearly better to execute fewer of them.

>st 1
Method 1: Retrieve Individual Row 00:00:00.103779
Method 2: Individual Row (Sorted) 00:00:00.109945
Method 3: using dynamic in        00:00:00.137431
Method 4: using sorted dynamic in 00:00:00.110824
Method 5: using in carray         00:00:00.171037
Method 5: using in carray sorted  00:00:00.165992

>st 10
Method 1: Retrieve Individual Row 00:00:01.023160
Method 2: Individual Row (Sorted) 00:00:01.187180
Method 3: using dynamic in        00:00:00.159182
Method 4: using sorted dynamic in 00:00:00.175053
Method 5: using in carray         00:00:00.192246
Method 5: using in carray sorted  00:00:00.154138

>st 100
Method 1: Retrieve Individual Row 00:00:10.543783
Method 2: Individual Row (Sorted) 00:00:10.305251
Method 3: using dynamic in        00:00:00.196502
Method 4: using sorted dynamic in 00:00:00.176414
Method 5: using in carray         00:00:00.203340
Method 5: using in carray sorted  00:00:00.191570

>st 1000
Method 1: Retrieve Individual Row 00:01:40.558009
Method 2: Individual Row (Sorted) 00:01:42.051622
Method 3: using dynamic in        00:00:00.246542
Method 4: using sorted dynamic in 00:00:00.238268
Method 5: using in carray         00:00:00.249394
Method 5: using in carray sorted  00:00:00.243244

>st 10000
Method 3: using dynamic in        00:00:00.277059
Method 4: using sorted dynamic in 00:00:00.296931
Method 5: using in carray         00:00:00.297005
Method 5: using in carray sorted  00:00:00.322317

>st 100000
Method 3: using dynamic in        00:00:00.761905
Method 4: using sorted dynamic in 00:00:00.765864
Method 5: using in carray         00:00:00.757057
Method 5: using in carray sorted  00:00:00.691111

>st 1000000
Method 3: using dynamic in        00:00:04.129529
Method 4: using sorted dynamic in 00:00:04.301129
Method 5: using in carray         00:00:04.114985
Method 5: using in carray sorted  00:00:04.417498


And the code:

#! python3

import apsw
import datetime
import random
import sqlite3
import sys
import time

datasize = 1000000
rows = int(sys.argv[1])

elapsed = lambda st, et: datetime.datetime.utcfromtimestamp((et - st)).time()

db = apsw.Connection('')
#db = sqlite3.connect('', isolation_level=None)

db.executescript('''
create table x
(
    id      integer primay key,
    data    blob
);
insert into x select value, randomblob(500) from generate_series where start=1 and stop=%d;
''' % (datasize,))

rowset = [random.randint(1, datasize) for i in range(rows)]

if rows <= 1000:
    print('Method 1: Retrieve Individual Row', end=' ', flush=True)
    st = time.time()
    db.executescript('BEGIN')
    for key in rowset:
        for row in db.execute('select * from x where id=?', (key,)):
            pass
    db.commit()
    print(elapsed(st, time.time()))

    print('Method 2: Individual Row (Sorted)', end=' ', flush=True)
    st = time.time()
    db.executescript('BEGIN')
    for key in sorted(rowset):
        for row in db.execute('select * from x where id=?', (key,)):
            pass
    db.commit()
    print(elapsed(st, time.time()))

print('Method 3: using dynamic in       ', end=' ', flush=True)
st = time.time()
for row in db.execute('select * from x where id in (' + ','.join(map(str, rowset)) + ')'):
    pass
print(elapsed(st, time.time()))

print('Method 4: using sorted dynamic in', end=' ', flush=True)
st = time.time()
for row in db.execute('select * from x where id in (' + ','.join(map(str, sorted(rowset))) + ')'):
    pass
print(elapsed(st, time.time()))

print('Method 5: using in carray        ', end=' ', flush=True)
st = time.time()
for row in db.execute('select * from x where id in carray(:l_address, :l_length, :l_type)', apsw.carray('l', rowset)):
    pass
print(elapsed(st, time.time()))

print('Method 5: using in carray sorted ', end=' ', flush=True)
st = time.time()
for row in db.execute('select * from x where id in carray(:l_address, :l_length, :l_type)', apsw.carray('l', sorted(rowset))):
    pass
print(elapsed(st, time.time()))

--
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.

>-----Original Message-----
>From: sqlite-users <[hidden email]> On
>Behalf Of Jens Alfke
>Sent: Friday, 13 September, 2019 10:39
>To: SQLite mailing list <[hidden email]>
>Subject: [sqlite] Fastest way to SELECT on a set of keys?
>
>If I have a set of primary keys (let's say a few hundred) and need to
>fetch data from the table rows with those keys, what's the fastest way to
>do so? The options seem to be:
>
>(a) Execute "SELECT … FROM table WHERE key=?", once for each key.
>(b) Execute "SELECT key, … FROM table WHERE key IN (…)", including all of
>the key strings.
>
>If I do (a), I can pre-prepare the statement and save the overhead of
>compilation. But SQLite has to go through the rest of its work (starting
>the virtual machine, b-tree lookup, etc.) once for each key.
>
>If I do (b), SQLite has less setup work to do, and it could potentially
>optimize the b-tree lookup. On the downside, I have to prepare a
>statement every time since the RHS of an "IN" isn't substitutable.
>
>Does anyone have intuition or actual knowledge about which approach is
>better? Or know of a 3rd better approach?
>
>—Jens
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Fastest way to SELECT on a set of keys?

Doug
I blows me away that you are able to produce such things as this at the drop of a hat!
Thanks for your insights and ingenuity and completeness!
Doug

> -----Original Message-----
> From: sqlite-users <[hidden email]>
> On Behalf Of Keith Medcalf
> Sent: Friday, September 13, 2019 1:30 PM
> To: SQLite mailing list <[hidden email]>
> Subject: Re: [sqlite] Fastest way to SELECT on a set of keys?
>
>
> That depends greatly on the overhead you have for executing each
> select statement.  So I wrote a little test that uses my
> customized apsw library from Python 3.  It also works using the
> as-distributed sqlite3 wrapper (except for the carray interface,
> which requires my customized apsw to be able to build and pass the
> object).  The overheads associated with each method are included
> in the elapsed time.  The only thing that is clear is that where
> the overhead of executing each select is significant it is clearly
> better to execute fewer of them.
>
> >st 1
> Method 1: Retrieve Individual Row 00:00:00.103779
> Method 2: Individual Row (Sorted) 00:00:00.109945
> Method 3: using dynamic in        00:00:00.137431
> Method 4: using sorted dynamic in 00:00:00.110824
> Method 5: using in carray         00:00:00.171037
> Method 5: using in carray sorted  00:00:00.165992
>
> >st 10
> Method 1: Retrieve Individual Row 00:00:01.023160
> Method 2: Individual Row (Sorted) 00:00:01.187180
> Method 3: using dynamic in        00:00:00.159182
> Method 4: using sorted dynamic in 00:00:00.175053
> Method 5: using in carray         00:00:00.192246
> Method 5: using in carray sorted  00:00:00.154138
>
> >st 100
> Method 1: Retrieve Individual Row 00:00:10.543783
> Method 2: Individual Row (Sorted) 00:00:10.305251
> Method 3: using dynamic in        00:00:00.196502
> Method 4: using sorted dynamic in 00:00:00.176414
> Method 5: using in carray         00:00:00.203340
> Method 5: using in carray sorted  00:00:00.191570
>
> >st 1000
> Method 1: Retrieve Individual Row 00:01:40.558009
> Method 2: Individual Row (Sorted) 00:01:42.051622
> Method 3: using dynamic in        00:00:00.246542
> Method 4: using sorted dynamic in 00:00:00.238268
> Method 5: using in carray         00:00:00.249394
> Method 5: using in carray sorted  00:00:00.243244
>
> >st 10000
> Method 3: using dynamic in        00:00:00.277059
> Method 4: using sorted dynamic in 00:00:00.296931
> Method 5: using in carray         00:00:00.297005
> Method 5: using in carray sorted  00:00:00.322317
>
> >st 100000
> Method 3: using dynamic in        00:00:00.761905
> Method 4: using sorted dynamic in 00:00:00.765864
> Method 5: using in carray         00:00:00.757057
> Method 5: using in carray sorted  00:00:00.691111
>
> >st 1000000
> Method 3: using dynamic in        00:00:04.129529
> Method 4: using sorted dynamic in 00:00:04.301129
> Method 5: using in carray         00:00:04.114985
> Method 5: using in carray sorted  00:00:04.417498
>
>
> And the code:
>
> #! python3
>
> import apsw
> import datetime
> import random
> import sqlite3
> import sys
> import time
>
> datasize = 1000000
> rows = int(sys.argv[1])
>
> elapsed = lambda st, et: datetime.datetime.utcfromtimestamp((et -
> st)).time()
>
> db = apsw.Connection('')
> #db = sqlite3.connect('', isolation_level=None)
>
> db.executescript('''
> create table x
> (
>     id      integer primay key,
>     data    blob
> );
> insert into x select value, randomblob(500) from generate_series
> where start=1 and stop=%d;
> ''' % (datasize,))
>
> rowset = [random.randint(1, datasize) for i in range(rows)]
>
> if rows <= 1000:
>     print('Method 1: Retrieve Individual Row', end=' ',
> flush=True)
>     st = time.time()
>     db.executescript('BEGIN')
>     for key in rowset:
>         for row in db.execute('select * from x where id=?',
> (key,)):
>             pass
>     db.commit()
>     print(elapsed(st, time.time()))
>
>     print('Method 2: Individual Row (Sorted)', end=' ',
> flush=True)
>     st = time.time()
>     db.executescript('BEGIN')
>     for key in sorted(rowset):
>         for row in db.execute('select * from x where id=?',
> (key,)):
>             pass
>     db.commit()
>     print(elapsed(st, time.time()))
>
> print('Method 3: using dynamic in       ', end=' ', flush=True)
> st = time.time()
> for row in db.execute('select * from x where id in (' +
> ','.join(map(str, rowset)) + ')'):
>     pass
> print(elapsed(st, time.time()))
>
> print('Method 4: using sorted dynamic in', end=' ', flush=True)
> st = time.time()
> for row in db.execute('select * from x where id in (' +
> ','.join(map(str, sorted(rowset))) + ')'):
>     pass
> print(elapsed(st, time.time()))
>
> print('Method 5: using in carray        ', end=' ', flush=True)
> st = time.time()
> for row in db.execute('select * from x where id in
> carray(:l_address, :l_length, :l_type)', apsw.carray('l',
> rowset)):
>     pass
> print(elapsed(st, time.time()))
>
> print('Method 5: using in carray sorted ', end=' ', flush=True)
> st = time.time()
> for row in db.execute('select * from x where id in
> carray(:l_address, :l_length, :l_type)', apsw.carray('l',
> sorted(rowset))):
>     pass
> print(elapsed(st, time.time()))
>
> --
> The fact that there's a Highway to Hell but only a Stairway to
> Heaven says a lot about anticipated traffic volume.
>
> >-----Original Message-----
> >From: sqlite-users <[hidden email]>
> On
> >Behalf Of Jens Alfke
> >Sent: Friday, 13 September, 2019 10:39
> >To: SQLite mailing list <[hidden email]>
> >Subject: [sqlite] Fastest way to SELECT on a set of keys?
> >
> >If I have a set of primary keys (let's say a few hundred) and
> need to
> >fetch data from the table rows with those keys, what's the
> fastest way to
> >do so? The options seem to be:
> >
> >(a) Execute "SELECT … FROM table WHERE key=?", once for each key.
> >(b) Execute "SELECT key, … FROM table WHERE key IN (…)",
> including all of
> >the key strings.
> >
> >If I do (a), I can pre-prepare the statement and save the
> overhead of
> >compilation. But SQLite has to go through the rest of its work
> (starting
> >the virtual machine, b-tree lookup, etc.) once for each key.
> >
> >If I do (b), SQLite has less setup work to do, and it could
> potentially
> >optimize the b-tree lookup. On the downside, I have to prepare a
> >statement every time since the RHS of an "IN" isn't
> substitutable.
> >
> >Does anyone have intuition or actual knowledge about which
> approach is
> >better? Or know of a 3rd better approach?
> >
> >—Jens
> >_______________________________________________
> >sqlite-users mailing list
> >[hidden email]
> >http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-
> users
>
>
>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-
> users

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Fastest way to SELECT on a set of keys?

Jens Alfke-2
In reply to this post by Keith Medcalf


> On Sep 13, 2019, at 1:30 PM, Keith Medcalf <[hidden email]> wrote:
>
>  The only thing that is clear is that where the overhead of executing each select is significant it is clearly better to execute fewer of them.

Thanks for the research, Keith!

In my case the per-query overhead is lower since I'm using the C interface, but I'm betting that it's still nontrivial, so I've decided to go with the "dynamic in" approach. From your measurements it doesn't look like further optimizations like carray are worth it.

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Fastest way to SELECT on a set of keys?

Jens Alfke-2
In reply to this post by Hick Gunter


> On Sep 13, 2019, at 10:57 AM, Hick Gunter <[hidden email]> wrote:
>
> This is faster if the number of keys in the list is small relative to the number of records in the table.
> If the number of keys is similar to the number of records in the table, then a simple full table scan may be faster.

Experimentally, the optimizer seems to choose an index search even with the simpler query. I ran this on a test database with about 30k rows.

> explain query plan select * from kv_default where key in ('a','b','c')

3|0|0|SEARCH TABLE kv_default USING INDEX sqlite_autoindex_kv_default_1 (key=?)

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Fastest way to SELECT on a set of keys?

Simon Slavin-3
On 16 Sep 2019, at 5:58pm, Jens Alfke <[hidden email]> wrote:

> Experimentally, the optimizer seems to choose an index search even with the simpler query. I ran this on a test database with about 30k rows.

In case you forgot I'm just reminding you to run ANALYZE after putting your data and indexes in.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Fastest way to SELECT on a set of keys?

Keith Medcalf
In reply to this post by Jens Alfke-2

It will, but that depends how many rows there are.  

That is, the statement:  SELECT * FROM t1 WHERE id IN (1,2,3,4,5,6)

Is equivalent to

CREATE TEMPORARY TABLE keyset (key PRIMARY KEY);
INSERT OR IGNORE INTO keyset VALUES (1), (2), (3), (4), (5), (6);
SELECT * FROM t1 WHERE id IN keyset;
DROP TABLE keyset;

without the overhead of parsing the extra create/insert/drop statements.  And of course the part:

SELECT * FROM t1 WHERE id IN keyset;

is really

SELECT * FROM t1 JOIN keyset ON t1.id == keyset.key;

which is just sugar for:

SELECT * FROM t1, keyset where t1.id == keyset.key;

That means that the query planner may place the keyset in the outer loop, or it may place t1 in the outer loop, depending on which is "better" and what else the query is doing.  In other words the list (...) just becomes an index containing the values that is subsequently treated as a table (and for that purpose the list (1,3,5,7,9) is the same as list (9,1,7,3,5,1,9,3,7) -- since it is sorted and unique).  Note that the list may contain NULLs but they are ignored.
 
--
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.

>-----Original Message-----
>From: sqlite-users <[hidden email]> On
>Behalf Of Jens Alfke
>Sent: Monday, 16 September, 2019 10:58
>To: SQLite mailing list <[hidden email]>
>Subject: Re: [sqlite] [EXTERNAL] Fastest way to SELECT on a set of keys?
>
>
>
>> On Sep 13, 2019, at 10:57 AM, Hick Gunter <[hidden email]> wrote:
>>
>> This is faster if the number of keys in the list is small relative to
>the number of records in the table.
>> If the number of keys is similar to the number of records in the table,
>then a simple full table scan may be faster.
>
>Experimentally, the optimizer seems to choose an index search even with
>the simpler query. I ran this on a test database with about 30k rows.
>
>> explain query plan select * from kv_default where key in ('a','b','c')
>
>3|0|0|SEARCH TABLE kv_default USING INDEX sqlite_autoindex_kv_default_1
>(key=?)
>
>—Jens
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Fastest way to SELECT on a set of keys?

Jose Isaias Cabrera-4

Keith Medcalf, on Monday, September 16, 2019 01:33 PM, wrote...

>
> It will, but that depends how many rows there are.
>
> That is, the statement:  SELECT * FROM t1 WHERE id IN (1,2,3,4,5,6)
>
> Is equivalent to
>
> CREATE TEMPORARY TABLE keyset (key PRIMARY KEY);
> INSERT OR IGNORE INTO keyset VALUES (1), (2), (3), (4), (5), (6);
> SELECT * FROM t1 WHERE id IN keyset;
> DROP TABLE keyset;
>
> without the overhead of parsing the extra create/insert/drop statements.
>  And of course the part:
>
> SELECT * FROM t1 WHERE id IN keyset;
>
> is really
>
> SELECT * FROM t1 JOIN keyset ON t1.id == keyset.key;
>
> which is just sugar for:
>
> SELECT * FROM t1, keyset where t1.id == keyset.key;
>
> That means that the query planner may place the keyset in the outer loop,
> or it may place t1 in the outer loop, depending on which is "better" and
> what else the query is doing.  In other words the list (...) just becomes
> an index containing the values that is subsequently treated as a table (and
> for that purpose the list (1,3,5,7,9) is the same as list
> (9,1,7,3,5,1,9,3,7) -- since it is sorted and unique).  Note that the list
> may contain NULLs but they are ignored.

Wow!  I learned a tone in this beauty!  Saving it! :-)

josé
Thanks
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Fastest way to SELECT on a set of keys?

E.Pasma
In reply to this post by Keith Medcalf
Stop stop stop

> create table x
> (
>    id      integer primay key,
>    data    blob
> );

I did not see this until searching for the word PRIMARY and not finding it. Thus id is not a primary key at all. Probably it is a good habit to always add WITHOUT ROWID when there is an explicit primary key. The SQL parser would then have reported an error.  

The tests with individual rows must definitely be repeated.
 
In my tests the results are closer together:

Method 1: Retrieve Individual Row 00:00:00.081151 10000
Method 3: using dynamic in        00:00:00.060368 9995
Method 5: using in carray         00:00:00.050884 9995
Method 5: using carray join       00:00:00.043127 10000
Method 6: Using temp table        00:00:00.060808 9995

(for oarameter = 10000)

I tuned the Python script, using fetchone() in the individual row test.
And I added a temp table test. In Python this just uses executemany() to insert the rowset.

Thanks, E. Pasma
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Fastest way to SELECT on a set of keys?

Keith Medcalf

On Monday, 16 September, 2019 14:22, E.Pasma <[hidden email]> wrote:

>Stop stop stop

>> create table x
>> (
>>    id      integer primay key,
>>    data    blob
>> );

>I did not see this until searching for the word PRIMARY and not finding
>it. Thus id is not a primary key at all. Probably it is a good habit to
>always add WITHOUT ROWID when there is an explicit primary key. The SQL
>parser would then have reported an error.

>The tests with individual rows must definitely be repeated.

>In my tests the results are closer together:

>Method 1: Retrieve Individual Row 00:00:00.081151 10000
>Method 3: using dynamic in        00:00:00.060368 9995
>Method 5: using in carray         00:00:00.050884 9995
>Method 5: using carray join       00:00:00.043127 10000
>Method 6: Using temp table        00:00:00.060808 9995

>(for oarameter = 10000)

>I tuned the Python script, using fetchone() in the individual row test.
>And I added a temp table test. In Python this just uses executemany() to
>insert the rowset.

You are right.  What a difference a spelling error makes ... No wonder it took so long as it was doing table scans -- and the optimizer was doing a jolly job in the other cases in dealing with it.

Note that the sqlite3 wrapper cannot do .executemany() with SELECT statements ... but it will do them with INSERT statements.  Nevertheless, the results are reasonably similar to these obtained with APSW ...

>st 1
Creating db and sample keys: 1000000 rows; 1 keys
Method 1: Individual Row          00:00:00.000150
Method 2: Individual Row (Sorted) 00:00:00.000079
Method 3: Rows with ExecMany      00:00:00.000063
Method 3: Rows with ExecMany Sort 00:00:00.000061
Method 4: Using IN temp           00:00:00.000323
Method 5: Using IN temp (sorted)  00:00:00.000266
Method 6: Using IN temp no rowid  00:00:00.000284
Method 7: Using IN (dynamic)      00:00:00.000078
Method 8: Using IN (sorted)       00:00:00.000061
Method 9: Using IN CArray         00:00:00.000195
Method A: Using IN CArray sorted  00:00:00.000075

>st 10
Creating db and sample keys: 1000000 rows; 10 keys
Method 1: Individual Row          00:00:00.000366
Method 2: Individual Row (Sorted) 00:00:00.000309
Method 3: Rows with ExecMany      00:00:00.000293
Method 3: Rows with ExecMany Sort 00:00:00.000305
Method 4: Using IN temp           00:00:00.000418
Method 5: Using IN temp (sorted)  00:00:00.000359
Method 6: Using IN temp no rowid  00:00:00.000430
Method 7: Using IN (dynamic)      00:00:00.000147
Method 8: Using IN (sorted)       00:00:00.000141
Method 9: Using IN CArray         00:00:00.000372
Method A: Using IN CArray sorted  00:00:00.000132

>st 100
Creating db and sample keys: 1000000 rows; 100 keys
Method 1: Individual Row          00:00:00.002314
Method 2: Individual Row (Sorted) 00:00:00.002144
Method 3: Rows with ExecMany      00:00:00.002010
Method 3: Rows with ExecMany Sort 00:00:00.001807
Method 4: Using IN temp           00:00:00.001042
Method 5: Using IN temp (sorted)  00:00:00.000963
Method 6: Using IN temp no rowid  00:00:00.001004
Method 7: Using IN (dynamic)      00:00:00.000573
Method 8: Using IN (sorted)       00:00:00.000548
Method 9: Using IN CArray         00:00:00.000671
Method A: Using IN CArray sorted  00:00:00.000588

>st 1000
Creating db and sample keys: 1000000 rows; 1000 keys
Method 1: Individual Row          00:00:00.019247
Method 2: Individual Row (Sorted) 00:00:00.017748
Method 3: Rows with ExecMany      00:00:00.016084
Method 3: Rows with ExecMany Sort 00:00:00.015766
Method 4: Using IN temp           00:00:00.007528
Method 5: Using IN temp (sorted)  00:00:00.007821
Method 6: Using IN temp no rowid  00:00:00.007600
Method 7: Using IN (dynamic)      00:00:00.005317
Method 8: Using IN (sorted)       00:00:00.004884
Method 9: Using IN CArray         00:00:00.005081
Method A: Using IN CArray sorted  00:00:00.005190

>st 10000
Creating db and sample keys: 1000000 rows; 10000 keys
Method 1: Individual Row          00:00:00.178937
Method 2: Individual Row (Sorted) 00:00:00.180979
Method 3: Rows with ExecMany      00:00:00.165302
Method 3: Rows with ExecMany Sort 00:00:00.163846
Method 4: Using IN temp           00:00:00.076111
Method 5: Using IN temp (sorted)  00:00:00.076974
Method 6: Using IN temp no rowid  00:00:00.077122
Method 7: Using IN (dynamic)      00:00:00.049132
Method 8: Using IN (sorted)       00:00:00.050656
Method 9: Using IN CArray         00:00:00.052837
Method A: Using IN CArray sorted  00:00:00.050192

>st 100000
Creating db and sample keys: 1000000 rows; 100000 keys
Method 1: Individual Row          00:00:01.777458
Method 2: Individual Row (Sorted) 00:00:01.708890
Method 3: Rows with ExecMany      00:00:01.676193
Method 3: Rows with ExecMany Sort 00:00:01.639589
Method 4: Using IN temp           00:00:00.756932
Method 5: Using IN temp (sorted)  00:00:00.742670
Method 6: Using IN temp no rowid  00:00:00.786706
Method 7: Using IN (dynamic)      00:00:00.504242
Method 8: Using IN (sorted)       00:00:00.503634
Method 9: Using IN CArray         00:00:00.546801
Method A: Using IN CArray sorted  00:00:00.510597

>st 1000000
Creating db and sample keys: 1000000 rows; 1000000 keys
Method 1: Individual Row          00:00:17.748779
Method 2: Individual Row (Sorted) 00:00:17.859159
Method 3: Rows with ExecMany      00:00:16.959563
Method 3: Rows with ExecMany Sort 00:00:16.261775
Method 4: Using IN temp           00:00:07.299388
Method 5: Using IN temp (sorted)  00:00:07.380643
Method 6: Using IN temp no rowid  00:00:07.543117
Method 7: Using IN (dynamic)      00:00:05.441127
Method 8: Using IN (sorted)       00:00:05.403368
Method 9: Using IN CArray         00:00:05.326564
Method A: Using IN CArray sorted  00:00:04.763342


Using this code:

#! python3

import apsw
import datetime
import random
import sqlite3
import sys
import time

datasize = 1000000
rows = int(sys.argv[1])

elapsed = lambda st, et: datetime.datetime.utcfromtimestamp((et - st)).time()
tuplize = lambda x: (x,)

db = apsw.Connection(':memory:')
#db = sqlite3.connect(':memory:', isolation_level=None)

print('Creating db and sample keys:', end=' ', flush=True)
db.executescript('''
create table x
(
    id      integer primary key,
    data    blob
);
insert into x
with a(x) as (
        select 1
     union all
        select x + 1
          from a
         where x < %d
             )
select x, randomblob(30)
  from a;
analyze;
''' % (datasize,))
print(db.execute('select count(*) from x').fetchone()[0], 'rows;', end=' ')

rowset = [i for i in range(datasize)]
random.shuffle(rowset)
rowset = rowset[:rows]
print(len(rowset), 'keys')

print('Method 1: Individual Row         ', end=' ', flush=True)
st = time.time()
db.executescript('BEGIN')
for key in rowset:
    row = db.execute('select * from x where id=?', (key,)).fetchone()
db.commit()
print(elapsed(st, time.time()))

print('Method 2: Individual Row (Sorted)', end=' ', flush=True)
st = time.time()
db.executescript('BEGIN')
for key in sorted(rowset):
    row = db.execute('select * from x where id=?', (key,)).fetchone()
db.commit()
print(elapsed(st, time.time()))

print('Method 3: Rows with ExecMany     ', end=' ', flush=True)
st = time.time()
db.executescript('BEGIN')
for row in db.executemany('select * from x where id=?', list(map(tuplize, rowset))):
    pass
db.commit()
print(elapsed(st, time.time()))

print('Method 3: Rows with ExecMany Sort', end=' ', flush=True)
st = time.time()
db.executescript('BEGIN')
for row in db.executemany('select * from x where id=?', list(map(tuplize, sorted(rowset)))):
    pass
db.commit()
print(elapsed(st, time.time()))

print('Method 4: Using IN temp          ', end=' ', flush=True)
st = time.time()
db.executescript('BEGIN')
db.executescript('create temporary table keys (key)')
db.executemany('insert into keys values (?)', list(map(tuplize, sorted(rowset))))
for row in db.execute('select * from x where id in temp.keys'):
    pass
db.executescript('drop table temp.keys')
db.commit()
print(elapsed(st, time.time()))

print('Method 5: Using IN temp (sorted) ', end=' ', flush=True)
st = time.time()
db.executescript('BEGIN')
db.executescript('create temporary table keys (key)')
db.executemany('insert into keys values (?)', list(map(tuplize, sorted(rowset))))
for row in db.execute('select * from x where id in temp.keys'):
    pass
db.executescript('drop table temp.keys')
db.commit()
print(elapsed(st, time.time()))

print('Method 6: Using IN temp no rowid ', end=' ', flush=True)
st = time.time()
db.executescript('BEGIN')
db.executescript('create temporary table keys (key primary key) without rowid')
db.executemany('insert or ignore into keys values (?)', list(map(tuplize, sorted(rowset))))
for row in db.execute('select * from x where id in temp.keys'):
    pass
db.executescript('drop table temp.keys')
db.commit()
print(elapsed(st, time.time()))

print('Method 7: Using IN (dynamic)     ', end=' ', flush=True)
st = time.time()
for row in db.execute('select * from x where id in (' + ','.join(map(str, rowset)) + ')'):
    pass
print(elapsed(st, time.time()))

print('Method 8: Using IN (sorted)      ', end=' ', flush=True)
st = time.time()
for row in db.execute('select * from x where id in (' + ','.join(map(str, sorted(rowset))) + ')'):
    pass
print(elapsed(st, time.time()))

print('Method 9: Using IN CArray        ', end=' ', flush=True)
st = time.time()
for row in db.execute('select * from x where id in carray(:l_address, :l_length, :l_type)', apsw.carray('l', rowset)):
    pass
print(elapsed(st, time.time()))

print('Method A: Using IN CArray sorted ', end=' ', flush=True)
st = time.time()
for row in db.execute('select * from x where id in carray(:l_address, :l_length, :l_type)', apsw.carray('l', sorted(rowset))):
    pass
print(elapsed(st, time.time()))

--
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.




_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Fastest way to SELECT on a set of keys?

wmertens
In reply to this post by Jens Alfke-2
On Fri, Sep 13, 2019 at 6:38 PM Jens Alfke <[hidden email]> wrote:

> (b) Execute "SELECT key, … FROM table WHERE key IN (…)", including all of
> the key strings.
>
> If I do (b), SQLite has less setup work to do, and it could potentially
> optimize the b-tree lookup. On the downside, I have to prepare a statement
> every time since the RHS of an "IN" isn't substitutable.


I solved the substitutability with the JSON1 extension, which does require
JSON-encoding your values but is presumably easier to set up than carray:

    ...WHERE key IN (SELECT value FROM json_each(?)))

Works great.

Wout.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Fastest way to SELECT on a set of keys?

E.Pasma
In reply to this post by Keith Medcalf
>
> Op 17 sep. 2019, om 04:26 heeft Keith Medcalf <[hidden email]> het volgende geschreven:
>
>
> On Monday, 16 September, 2019 14:22, E.Pasma <[hidden email]> wrote:
>
>> Stop stop stop
>
> You are right.  What a difference a spelling error makes ... No wonder it took so long as it was doing table scans -- and the optimizer was doing a jolly job in the other cases in dealing with it.
>
> Note that the sqlite3 wrapper cannot do .executemany() with SELECT statements ... but it will do them with INSERT statements.  Nevertheless, the results are reasonably similar to these obtained with APSW ...
>
...

>> st 1000
> Creating db and sample keys: 1000000 rows; 1000 keys
> Method 1: Individual Row          00:00:00.019247
> Method 2: Individual Row (Sorted) 00:00:00.017748
> Method 3: Rows with ExecMany      00:00:00.016084
> Method 3: Rows with ExecMany Sort 00:00:00.015766
> Method 4: Using IN temp           00:00:00.007528
> Method 5: Using IN temp (sorted)  00:00:00.007821
> Method 6: Using IN temp no rowid  00:00:00.007600
> Method 7: Using IN (dynamic)      00:00:00.005317
> Method 8: Using IN (sorted)       00:00:00.004884
> Method 9: Using IN CArray         00:00:00.005081
> Method A: Using IN CArray sorted  00:00:00.005190
..

> Using this code:
>
> #! python3
>
> import apsw
> import datetime
> import random
> import sqlite3
> import sys
> import time
>
> datasize = 1000000
> rows = int(sys.argv[1])
>
> elapsed = lambda st, et: datetime.datetime.utcfromtimestamp((et - st)).time()
> tuplize = lambda x: (x,)
>
> db = apsw.Connection(':memory:')
> #db = sqlite3.connect(':memory:', isolation_level=None)
>
> print('Creating db and sample keys:', end=' ', flush=True)
> db.executescript('''
> create table x
> (
>    id      integer primary key,
>    data    blob
> );
> insert into x
> with a(x) as (
>        select 1
>     union all
>        select x + 1
>          from a
>         where x < %d
>             )
> select x, randomblob(30)
>  from a;
> analyze;
> ''' % (datasize,))
> print(db.execute('select count(*) from x').fetchone()[0], 'rows;', end=' ')
>
> rowset = [i for i in range(datasize)]
> random.shuffle(rowset)
> rowset = rowset[:rows]
> print(len(rowset), 'keys')
>
> print('Method 1: Individual Row         ', end=' ', flush=True)
> st = time.time()
> db.executescript('BEGIN')
> for key in rowset:
>    row = db.execute('select * from x where id=?', (key,)).fetchone()
> db.commit()
> print(elapsed(st, time.time()))
>
> print('Method 2: Individual Row (Sorted)', end=' ', flush=True)
> st = time.time()
> db.executescript('BEGIN')
> for key in sorted(rowset):
>    row = db.execute('select * from x where id=?', (key,)).fetchone()
> db.commit()
> print(elapsed(st, time.time()))
>
> print('Method 3: Rows with ExecMany     ', end=' ', flush=True)
> st = time.time()
> db.executescript('BEGIN')
> for row in db.executemany('select * from x where id=?', list(map(tuplize, rowset))):
>    pass
> db.commit()
> print(elapsed(st, time.time()))
>
> print('Method 3: Rows with ExecMany Sort', end=' ', flush=True)
> st = time.time()
> db.executescript('BEGIN')
> for row in db.executemany('select * from x where id=?', list(map(tuplize, sorted(rowset)))):
>    pass
> db.commit()
> print(elapsed(st, time.time()))
>
> print('Method 4: Using IN temp          ', end=' ', flush=True)
> st = time.time()
> db.executescript('BEGIN')
> db.executescript('create temporary table keys (key)')
> db.executemany('insert into keys values (?)', list(map(tuplize, sorted(rowset))))
> for row in db.execute('select * from x where id in temp.keys'):
>    pass
> db.executescript('drop table temp.keys')
> db.commit()
> print(elapsed(st, time.time()))
>
> print('Method 5: Using IN temp (sorted) ', end=' ', flush=True)
> st = time.time()
> db.executescript('BEGIN')
> db.executescript('create temporary table keys (key)')
> db.executemany('insert into keys values (?)', list(map(tuplize, sorted(rowset))))
> for row in db.execute('select * from x where id in temp.keys'):
>    pass
> db.executescript('drop table temp.keys')
> db.commit()
> print(elapsed(st, time.time()))
>
> print('Method 6: Using IN temp no rowid ', end=' ', flush=True)
> st = time.time()
> db.executescript('BEGIN')
> db.executescript('create temporary table keys (key primary key) without rowid')
> db.executemany('insert or ignore into keys values (?)', list(map(tuplize, sorted(rowset))))
> for row in db.execute('select * from x where id in temp.keys'):
>    pass
> db.executescript('drop table temp.keys')
> db.commit()
> print(elapsed(st, time.time()))
>
> print('Method 7: Using IN (dynamic)     ', end=' ', flush=True)
> st = time.time()
> for row in db.execute('select * from x where id in (' + ','.join(map(str, rowset)) + ')'):
>    pass
> print(elapsed(st, time.time()))
>
> print('Method 8: Using IN (sorted)      ', end=' ', flush=True)
> st = time.time()
> for row in db.execute('select * from x where id in (' + ','.join(map(str, sorted(rowset))) + ')'):
>    pass
> print(elapsed(st, time.time()))
>
> print('Method 9: Using IN CArray        ', end=' ', flush=True)
> st = time.time()
> for row in db.execute('select * from x where id in carray(:l_address, :l_length, :l_type)', apsw.carray('l', rowset)):
>    pass
> print(elapsed(st, time.time()))
>
> print('Method A: Using IN CArray sorted ', end=' ', flush=True)
> st = time.time()
> for row in db.execute('select * from x where id in carray(:l_address, :l_length, :l_type)', apsw.carray('l', sorted(rowset))):
>    pass
> print(elapsed(st, time.time()))
>
> --
> The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.
>
I edited this second script to use plain apsw.
In the vi editor:
%s/db.execute/db.cursor().execute/
%s/executescript/execute/
%s/db.commit()/db.cursor().execute("COMMIT")/
/Method 9
.,$d

(the carray tests are left out)
My test output for 1000 keys is:
$ python3 keith2b.py 1000
Creating db and sample keys: 1000000 rows; 1000 keys
Method 1: Individual Row          00:00:00.003748
Method 2: Individual Row (Sorted) 00:00:00.003545
Method 3: Rows with ExecMany      00:00:00.003300
Method 3: Rows with ExecMany Sort 00:00:00.003088
Method 4: Using IN temp           00:00:00.003838
Method 5: Using IN temp (sorted)  00:00:00.003850
Method 6: Using IN temp no rowid  00:00:00.003941
Method 7: Using IN (dynamic)      00:00:00.003276
Method 8: Using IN (sorted)       00:00:00.003223

This is much different as in your output.
- the test is two times faster here (on a moderate system)
- there is no substantial difference any longer between individual tests (excluded carray)

Any idea?

Thanks for the inviting tests.
E.Pasma
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Fastest way to SELECT on a set of keys?

Keith Medcalf
>I edited this second script to use plain apsw.
>In the vi editor:
>%s/db.execute/db.cursor().execute/
>%s/executescript/execute/
>%s/db.commit()/db.cursor().execute("COMMIT")/
>/Method 9
>.,$d
>
>(the carray tests are left out)
>My test output for 1000 keys is:
>$ python3 keith2b.py 1000
>Creating db and sample keys: 1000000 rows; 1000 keys
>Method 1: Individual Row          00:00:00.003748
>Method 2: Individual Row (Sorted) 00:00:00.003545
>Method 3: Rows with ExecMany      00:00:00.003300
>Method 3: Rows with ExecMany Sort 00:00:00.003088
>Method 4: Using IN temp           00:00:00.003838
>Method 5: Using IN temp (sorted)  00:00:00.003850
>Method 6: Using IN temp no rowid  00:00:00.003941
>Method 7: Using IN (dynamic)      00:00:00.003276
>Method 8: Using IN (sorted)       00:00:00.003223
>
>This is much different as in your output.
>- the test is two times faster here (on a moderate system)
>- there is no substantial difference any longer between individual tests
>(excluded carray)
>
>Any idea?
>
>Thanks for the inviting tests.

I can actually.  

My apsw.Connection class is actually a python class that inherits from the real apsw.Connection class so that I can add a bunch of extra's to it.  For example, the apsw.Connection.execute method(s) are not direct delegates to apsw.Connection.Cursor().execute method -- it also does some scanning of the bind parameters so that it can handle datetime objects conversion to text.  Similarly I have an active exec tracer and row tracer so that output tuples are converted to objects and datetime data can be converted to real datetime objects.

For the most part these overheads are static per execute call and per row returned so mostly cancel themselves out (but make everything somewhat slower).  It is also why the execution of many select's degrades so quickly -- the setup time is per select -- and the executemany degrades slower (it is only calling back on the exe tracer for each statement rather than doing the whole overhead for each statement).  The row tracer overhead is constant since the actual number of rows returned is constant on each set.

Some other discrepancies are apparently native to the differences between running on Linux vs Windows 10 -- Windows 10 seems to be somewhat less deterministic in its scheduling and even CPython itself seems to be somewhat different between.

So, I modified my delegate slightly so that if the exec/row tracer is unhooked then the bind parameters are not scanned either (but there is still a bit of overhead per statement executed due to the delegation).  Here are the results:

Full exec and row hook processing with bind parameter scanning:

1000000 rows; 1000 keys; 0.100000%
Method 1: Individual Row          00:00:00.018191   54971 rps external order
Method 2: Individual Row (Sorted) 00:00:00.018101   55244 rps external order (sorted)
Method 3: Rows ExecuteMany        00:00:00.016854   59332 rps external order
Method 3: Rows ExecuteMany Sorted 00:00:00.016159   61885 rps external order (sorted)
Method 4: Using IN temp           00:00:00.007869  127088 rps order by id
Method 5: Using IN temp (sorted)  00:00:00.008125  123083 rps order by id
Method 6: Using IN keyset         00:00:00.009833  101697 rps order by id
Method 7: Using IN keyset sorted  00:00:00.008166  122461 rps order by id
Method 8: Using IN (dynamic)      00:00:00.004820  207474 rps order by id
Method 9: Using IN (sorted)       00:00:00.005196  192452 rps order by id
Method A: Using IN CArray         00:00:00.005440  183815 rps order by id
Method B: Using IN CArray sorted  00:00:00.005891  169741 rps order by id

No row or exec tracers, bind parameter scanning bypassed:

1000000 rows; 1000 keys; 0.100000%
Method 1: Individual Row          00:00:00.003435  291089 rps external order
Method 2: Individual Row (Sorted) 00:00:00.003366  297047 rps external order (sorted)
Method 3: Rows ExecuteMany        00:00:00.002942  339950 rps external order
Method 3: Rows ExecuteMany Sorted 00:00:00.002892  345807 rps external order (sorted)
Method 4: Using IN temp           00:00:00.003435  291129 rps order by id
Method 5: Using IN temp (sorted)  00:00:00.003419  292449 rps order by id
Method 6: Using IN keyset         00:00:00.003649  274083 rps order by id
Method 7: Using IN keyset sorted  00:00:00.003626  275814 rps order by id
Method 8: Using IN (dynamic)      00:00:00.002526  395950 rps order by id
Method 9: Using IN (sorted)       00:00:00.002706  369574 rps order by id
Method A: Using IN CArray         00:00:00.002902  344557 rps order by id
Method B: Using IN CArray sorted  00:00:00.002656  376508 rps order by id

No row or exec tracers, calling .cursor() methods directly (so minimize delegation processing):

1000000 rows; 1000 keys; 0.100000%
Method 1: Individual Row          00:00:00.003267  306108 rps external order
Method 2: Individual Row (Sorted) 00:00:00.003083  324310 rps external order (sorted)
Method 3: Rows ExecuteMany        00:00:00.002665  375262 rps external order
Method 3: Rows ExecuteMany Sorted 00:00:00.002772  360738 rps external order (sorted)
Method 4: Using IN temp           00:00:00.003426  291858 rps order by id
Method 5: Using IN temp (sorted)  00:00:00.003491  286457 rps order by id
Method 6: Using IN keyset         00:00:00.003555  281326 rps order by id
Method 7: Using IN keyset sorted  00:00:00.003415  292857 rps order by id
Method 8: Using IN (dynamic)      00:00:00.002914  343204 rps order by id
Method 9: Using IN (sorted)       00:00:00.002677  373557 rps order by id
Method A: Using IN CArray         00:00:00.002718  367856 rps order by id
Method B: Using IN CArray sorted  00:00:00.002936  340612 rps order by id

Using --sqlite3 (does not do executemany for select statements):

1000000 rows; 1000 keys; 0.100000%
Method 1: Individual Row          00:00:00.004060  246303 rps external order
Method 2: Individual Row (Sorted) 00:00:00.003985  250945 rps external order (sorted)
Method 3: Rows ExecuteMany
Method 3: Rows ExecuteMany Sorted
Method 4: Using IN temp           00:00:00.003985  250930 rps order by id
Method 5: Using IN temp (sorted)  00:00:00.003968  252031 rps order by id
Method 6: Using IN keyset         00:00:00.004013  249201 rps order by id
Method 7: Using IN keyset sorted  00:00:00.004247  235463 rps order by id
Method 8: Using IN (dynamic)      00:00:00.002576  388217 rps order by id
Method 9: Using IN (sorted)       00:00:00.002456  407174 rps order by id

Note that I added rps (rows per second) which is merely int(rows / et) and added "order by id" to each query that returns multiple rows so that they will all get the same results in the same order for each test (except the first 4 which will of course retrieve the rows in the order asked for since you get one row per select).  So yes, I would conclude that they are all about the same speed so it is really a question of which is the least work to implement, though overall I would probably favour the dynamic in clause format.

Note that the parameter is now the exponent of the set, so 1 is 10 records, 2 is 100 records, ... 6 is 1 million records ... and the test runs for each power of ten up to that specified ...

#! python3

import datetime
import random
import sys
import time

exponent = int(sys.argv[1])

elapsed = lambda et: datetime.datetime.utcfromtimestamp(et).time()
tuplize = lambda x: (x,)
carrray = None

if '--sqlite3' in sys.argv:
    import sqlite3
    db = sqlite3.connect(':memory:', isolation_level=None)
else:
    import apsw
    if '--unhook' in sys.argv and hasattr(apsw, 'rowunhook'):
        apsw.rowunhook()
    db = apsw.Connection(':memory:')
    if hasattr(apsw, 'carray'):
        carray = apsw.carray

datasize = 10 ** exponent

print('Creating db and sample keys:', end=' ', flush=True)
db.executescript('''
create table x
(
    id      integer primary key,
    data    blob
);
insert into x
with a(x) as (
        select 1
     union all
        select x + 1
          from a
         where x < %d
             )
select x, randomblob(30)
  from a;
analyze;
''' % (datasize,))
print(db.execute('select count(*) from x').fetchone()[0], 'rows')

masterkeys = [i for i in range(datasize)]
random.shuffle(masterkeys)

for rowexp in range(exponent + 1):
    rows = 10 ** rowexp

    rowset = masterkeys[:rows]

    print()
    print(datasize, 'rows;', rows, 'keys;', '%f%%' % (rows*100/datasize,))
    # -------------------------------------------------------------------------
    try:
        print('Method 1: Individual Row         ', end=' ', flush=True)
        st = time.time()
        db.executescript('BEGIN')
        for key in rowset:
            row = db.execute('select * from x where id=?', (key,)).fetchone()
        db.commit()
        et = time.time() - st
        print(elapsed(et), '%7d' % (rows / et,), 'rps external order')
    except:
        print()
    # -------------------------------------------------------------------------
    try:
        print('Method 2: Individual Row (Sorted)', end=' ', flush=True)
        st = time.time()
        db.executescript('BEGIN')
        for key in sorted(rowset):
            row = db.execute('select * from x where id=?', (key,)).fetchone()
        db.commit()
        et = time.time() - st
        print(elapsed(et), '%7d' % (rows / et,), 'rps external order (sorted)')
    except:
        print()
    # -------------------------------------------------------------------------
    try:
        print('Method 3: Rows ExecuteMany       ', end=' ', flush=True)
        st = time.time()
        db.executescript('BEGIN')
        for row in db.executemany('select * from x where id=?', list(map(tuplize, rowset))):
            pass
        db.commit()
        et = time.time() - st
        print(elapsed(et), '%7d' % (rows / et,), 'rps external order')
    except:
        print()
    # -------------------------------------------------------------------------
    try:
        print('Method 3: Rows ExecuteMany Sorted', end=' ', flush=True)
        st = time.time()
        db.executescript('BEGIN')
        for row in db.executemany('select * from x where id=?',
                                  list(map(tuplize, sorted(rowset)))):
            pass
        db.commit()
        et = time.time() - st
        print(elapsed(et), '%7d' % (rows / et,), 'rps external order (sorted)')
    except:
        print()
    # -------------------------------------------------------------------------
    try:
        print('Method 4: Using IN temp          ', end=' ', flush=True)
        st = time.time()
        db.executescript('BEGIN')
        db.executescript('create temporary table keys (key)')
        db.executemany('insert into keys values (?)', list(map(tuplize, rowset)))
        for row in db.execute('select * from x where id in temp.keys order by id'):
            pass
        db.executescript('drop table temp.keys')
        db.commit()
        et = time.time() - st
        print(elapsed(et), '%7d' % (rows / et,), 'rps order by id')
    except:
        print()
    # -------------------------------------------------------------------------
    try:
        print('Method 5: Using IN temp (sorted) ', end=' ', flush=True)
        st = time.time()
        db.executescript('BEGIN')
        db.executescript('create temporary table keys (key)')
        db.executemany('insert into keys values (?)', list(map(tuplize, sorted(rowset))))
        for row in db.execute('select * from x where id in temp.keys order by id'):
            pass
        db.executescript('drop table temp.keys')
        db.commit()
        et = time.time() - st
        print(elapsed(et), '%7d' % (rows / et,), 'rps order by id')
    except:
        print()
    # -------------------------------------------------------------------------
    try:
        print('Method 6: Using IN keyset        ', end=' ', flush=True)
        st = time.time()
        db.executescript('BEGIN')
        db.executescript('create temporary table keys (key primary key) without rowid')
        db.executemany('insert or ignore into keys values (?)', list(map(tuplize, rowset)))
        for row in db.execute('select * from x where id in temp.keys order by id'):
            pass
        db.executescript('drop table temp.keys')
        db.commit()
        et = time.time() - st
        print(elapsed(et), '%7d' % (rows / et,), 'rps order by id')
    except:
        print()
    # -------------------------------------------------------------------------
    try:
        print('Method 7: Using IN keyset sorted ', end=' ', flush=True)
        st = time.time()
        db.executescript('BEGIN')
        db.executescript('create temporary table keys (key primary key) without rowid')
        db.executemany('insert or ignore into keys values (?)', list(map(tuplize, rowset)))
        for row in db.execute('select * from x where id in temp.keys order by id'):
            pass
        db.executescript('drop table temp.keys')
        db.commit()
        et = time.time() - st
        print(elapsed(et), '%7d' % (rows / et,), 'rps order by id')
    except:
        print()
    # -------------------------------------------------------------------------
    try:
        print('Method 8: Using IN (dynamic)     ', end=' ', flush=True)
        st = time.time()
        for row in db.execute('select * from x where id in (' + ','.join(map(str, rowset)) + ') order by id'):
            pass
        et = time.time() - st
        print(elapsed(et), '%7d' % (rows / et,), 'rps order by id')
    except:
        print()
    # -------------------------------------------------------------------------
    try:
        print('Method 9: Using IN (sorted)      ', end=' ', flush=True)
        st = time.time()
        for row in db.execute('select * from x where id in (' + ','.join(map(str, sorted(rowset))) + ') order by id'):
            pass
        et = time.time() - st
        print(elapsed(et), '%7d' % (rows / et,), 'rps order by id')
    except:
        print()

    if carray:
        # -------------------------------------------------------------------------
        try:
            print('Method A: Using IN CArray        ', end=' ', flush=True)
            st = time.time()
            for row in db.execute('select * from x where id in carray(:l_address, :l_length, :l_type) order by id', carray('l', rowset)):
                pass
            et = time.time() - st
            print(elapsed(et), '%7d' % (rows / et,), 'rps order by id')
        except:
            print()
        # -------------------------------------------------------------------------
        try:
            print('Method B: Using IN CArray sorted ', end=' ', flush=True)
            st = time.time()
            for row in db.execute('select * from x where id in carray(:l_address, :l_length, :l_type) order by id', carray('l', sorted(rowset))):
                pass
            et = time.time() - st
            print(elapsed(et), '%7d' % (rows / et,), 'rps order by id')
        except:
            print()
        # -------------------------------------------------------------------------

--
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Fastest way to SELECT on a set of keys?

E.Pasma
In reply to this post by Keith Medcalf
Keith,

The final script produces corresponding results here, only a constant factor slower (minimal CPU). The rows per second is useful to summarize the tests for various keyset sizes. Below is the average per method with input parameter 5.

meth|rps|note
1|149431|Individual Row        
2|195447|Individual Row (Sorted)
3|167740|Rows ExecuteMany      
3|167740|Rows ExecuteMany Sorted
4|146503|Using IN temp          
5|149261|Using IN temp (sorted)
6|137831|Using IN keyset
7|136984|Using IN keyset sorted
8|170922|Using IN (dynamic)
9|188759|Using IN (sorted)      
A|242761|Using IN CArray        
B|274883|Using IN CArray sorted
C|308547|Using Array JOIN sorted

Hope this is useful to the original poster.

To me SQLite-Python is almost addicting. I learned to use carray now. It appears to interface brillantly with Python's array module. Only I have a custom carray instead of a custom execute method as you APSW (replaced sqlite3_bind_pointer by sqlite3_value_int64, for home use only).

Method C is a JOIN to carray, where the keys are sorted. Order by is not needed then.
       select x.* from carray(?,?,'int64') cross join x on id=value
             

Thanks, E. Pasma
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users