Slow joining of tables with indexes

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Slow joining of tables with indexes

Jonathan Moules-3
Hi List,
I have a relational table setup where I've built indexes but I'm still
seeing very slow join times on middling amounts of data. I'm guessing
I'm doing something wrong but I can't see what. (SQLite: 3.24.0)

Simplified schema as below.
The ids are 16 character hex strings. I've included the ignore_me table
only because it's relevant to the indexes.
Note: I can *guarantee* that the data inserted into `data_table` and
`ignore_me` is ordered by their respective primary keys ASC. Entries in
joining_table are ordered by one of either data_id ASC or ignored_id ASC
depending on creation method.

--==================================

-- 1.7 million items
CREATE TABLE data_table (
     data_id                    TEXT     PRIMARY KEY
                                         NOT NULL
                                         COLLATE NOCASE,
     data_1                        TEXT,
     data_2                         TEXT
);

-- 1.9 million items
CREATE TABLE joining_table (
     data_id         TEXT     REFERENCES data_table (data_id)
                             NOT NULL
                             COLLATE NOCASE,
     ignored_id         TEXT    REFERENCES ignore_me (ignored_id)
                             NOT NULL
                             COLLATE NOCASE,
     misc_col_1        TEXT,
     misc_col_2        TEXT
);

-- ~200,000 items
CREATE TABLE ignore_me (
     ignored_id                    TEXT     PRIMARY KEY
                                         NOT NULL
                                         COLLATE NOCASE
);

CREATE INDEX IF NOT EXISTS data_table__data_id__pk_idx ON data_table (
     data_id
);
CREATE INDEX IF NOT EXISTS ignore_me__ignored_id__pk_idx ON ignore_me (
     ignored_id
);

-- Allow quick joining from data_table to ignore_me
CREATE INDEX IF NOT EXISTS joining_table__data_ignored_id__fk_idx ON
joining_table (
     data_id ASC,
     ignored_id ASC
);
-- Allow quick joining from ignore_me to data_table
CREATE INDEX IF NOT EXISTS joining_table__ignored_data_id__fk_idx ON
joining_table (
     ignored_id ASC,
     data_id ASC
);

-- Example data:

INSERT INTO data_table (data_id) VALUES ('00196a21e8c0f9f6');
INSERT INTO data_table (data_id) VALUES ('579c57f1268c0f5c');

INSERT INTO ignore_me VALUES ('c402eb3f05d433f2');
INSERT INTO ignore_me VALUES ('d827e58953265f63');
INSERT INTO ignore_me VALUES ('ec1d2e817f55b249');

INSERT INTO joining_table (data_id, ignored_id) VALUES
('00196a21e8c0f9f6', 'c402eb3f05d433f2');
INSERT INTO joining_table (data_id, ignored_id) VALUES
('00196a21e8c0f9f6', 'd827e58953265f63');
INSERT INTO joining_table (data_id, ignored_id) VALUES
('579c57f1268c0f5c', 'ec1d2e817f55b249');


--------------------

-- Then to test the speed I'm simply doing:
     SELECT
         count(1)
     FROM
         data_table
         JOIN joining_table USING (data_id);

--==================================

The query plan says it's using the indexes:
     SCAN TABLE joining_table USING COVERING INDEX
joining_table__ignored_data_id__fk_idx
     SEARCH TABLE data_table USING COVERING INDEX
data_table__data_id__pk_idx (data_id=?)

But it takes about 20 seconds to do that count on the full dataset.

The full EXPLAIN from the full dataset:

0    Init    0    16    0        00
1    Null    0    1    1        00
2    OpenRead    2    771875    0    k(3,NOCASE,NOCASE,)    00
3    OpenRead    3    737715    0    k(2,NOCASE,)    02
4    Rewind    2    12    2    0    00
5    Column    2    1    2        00
6    SeekGE    3    11    2    1    00
7    IdxGT    3    11    2    1    00
8    Integer    1    3    0        00
9    AggStep0    0    3    1    count(1)    01
10    Next    3    7    1        00
11    Next    2    5    0        01
12    AggFinal    1    1    0    count(1)    00
13    Copy    1    4    0        00
14    ResultRow    4    1    0        00
15    Halt    0    0    0        00
16    Transaction    0    0    77    0    01
17    Goto    0    1    0        00

Thoughts? What (probably obvious) thing am I missing?

Thanks,
Jonathan



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Slow joining of tables with indexes

Hick Gunter
You are using text columns as primary keys and referencing them directly in foreign keys. This is probably not what you want, because it duplicates the text key. Also, with foreign keys enabled, your join is not accomplishing anything more than a direct select from joining_table, just with more effort (and circumventing the count() optimization).

SQLite uses an 64bit Integer as a rowid that uniquely identifies the row in the table. This is what you should be using as a foreign key, because it is twice as fast as using an index.

OTOH, SQLite supports WITHOUT ROWID tables, you might like to read up on those too

-----Ursprüngliche Nachricht-----
Von: sqlite-users [mailto:[hidden email]] Im Auftrag von Jonathan Moules
Gesendet: Dienstag, 26. November 2019 10:25
An: SQLite mailing list <[hidden email]>
Betreff: [EXTERNAL] [sqlite] Slow joining of tables with indexes

Hi List,
I have a relational table setup where I've built indexes but I'm still seeing very slow join times on middling amounts of data. I'm guessing I'm doing something wrong but I can't see what. (SQLite: 3.24.0)

Simplified schema as below.
The ids are 16 character hex strings. I've included the ignore_me table only because it's relevant to the indexes.
Note: I can *guarantee* that the data inserted into `data_table` and `ignore_me` is ordered by their respective primary keys ASC. Entries in joining_table are ordered by one of either data_id ASC or ignored_id ASC depending on creation method.

--==================================

-- 1.7 million items
CREATE TABLE data_table (
     data_id                    TEXT     PRIMARY KEY
                                         NOT NULL
                                         COLLATE NOCASE,
     data_1                        TEXT,
     data_2                         TEXT );

-- 1.9 million items
CREATE TABLE joining_table (
     data_id         TEXT     REFERENCES data_table (data_id)
                             NOT NULL
                             COLLATE NOCASE,
     ignored_id         TEXT    REFERENCES ignore_me (ignored_id)
                             NOT NULL
                             COLLATE NOCASE,
     misc_col_1        TEXT,
     misc_col_2        TEXT
);

-- ~200,000 items
CREATE TABLE ignore_me (
     ignored_id                    TEXT     PRIMARY KEY
                                         NOT NULL
                                         COLLATE NOCASE );

CREATE INDEX IF NOT EXISTS data_table__data_id__pk_idx ON data_table (
     data_id
);
CREATE INDEX IF NOT EXISTS ignore_me__ignored_id__pk_idx ON ignore_me (
     ignored_id
);

-- Allow quick joining from data_table to ignore_me CREATE INDEX IF NOT EXISTS joining_table__data_ignored_id__fk_idx ON joining_table (
     data_id ASC,
     ignored_id ASC
);
-- Allow quick joining from ignore_me to data_table CREATE INDEX IF NOT EXISTS joining_table__ignored_data_id__fk_idx ON joining_table (
     ignored_id ASC,
     data_id ASC
);

-- Example data:

INSERT INTO data_table (data_id) VALUES ('00196a21e8c0f9f6'); INSERT INTO data_table (data_id) VALUES ('579c57f1268c0f5c');

INSERT INTO ignore_me VALUES ('c402eb3f05d433f2'); INSERT INTO ignore_me VALUES ('d827e58953265f63'); INSERT INTO ignore_me VALUES ('ec1d2e817f55b249');

INSERT INTO joining_table (data_id, ignored_id) VALUES ('00196a21e8c0f9f6', 'c402eb3f05d433f2'); INSERT INTO joining_table (data_id, ignored_id) VALUES ('00196a21e8c0f9f6', 'd827e58953265f63'); INSERT INTO joining_table (data_id, ignored_id) VALUES ('579c57f1268c0f5c', 'ec1d2e817f55b249');


--------------------

-- Then to test the speed I'm simply doing:
     SELECT
         count(1)
     FROM
         data_table
         JOIN joining_table USING (data_id);

--==================================

The query plan says it's using the indexes:
     SCAN TABLE joining_table USING COVERING INDEX joining_table__ignored_data_id__fk_idx
     SEARCH TABLE data_table USING COVERING INDEX data_table__data_id__pk_idx (data_id=?)

But it takes about 20 seconds to do that count on the full dataset.

The full EXPLAIN from the full dataset:

0    Init    0    16    0        00
1    Null    0    1    1        00
2    OpenRead    2    771875    0    k(3,NOCASE,NOCASE,)    00
3    OpenRead    3    737715    0    k(2,NOCASE,)    02
4    Rewind    2    12    2    0    00
5    Column    2    1    2        00
6    SeekGE    3    11    2    1    00
7    IdxGT    3    11    2    1    00
8    Integer    1    3    0        00
9    AggStep0    0    3    1    count(1)    01
10    Next    3    7    1        00
11    Next    2    5    0        01
12    AggFinal    1    1    0    count(1)    00
13    Copy    1    4    0        00
14    ResultRow    4    1    0        00
15    Halt    0    0    0        00
16    Transaction    0    0    77    0    01
17    Goto    0    1    0        00

Thoughts? What (probably obvious) thing am I missing?

Thanks,
Jonathan



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


___________________________________________
 Gunter Hick | Software Engineer | Scientific Games International GmbH | Klitschgasse 2-4, A-1130 Vienna | FN 157284 a, HG Wien, DVR: 0430013 | (O) +43 1 80100 - 0

May be privileged. May be confidential. Please delete if not the addressee.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Slow joining of tables with indexes

David Raymond
Not the reason for the slowdown, but note that both of these are redundant:

CREATE INDEX IF NOT EXISTS data_table__data_id__pk_idx ON data_table (
     data_id
);
CREATE INDEX IF NOT EXISTS ignore_me__ignored_id__pk_idx ON ignore_me (
     ignored_id
);

...because you declared them as the primary keys in the table creation. So you now have 2 different indexes on the exact same data for each of those.


The rest of it looks fine to me anyway, and I'm not sure why you'd be seeing such slow times. Old slow hard disk?

If you analyze and vacuum it does it get any better?

I think the CLI has something like ".scanstats on" to get a little more info, but I'm not sure how much more info it'll provide.


-----Original Message-----
From: sqlite-users <[hidden email]> On Behalf Of Hick Gunter
Sent: Tuesday, November 26, 2019 4:57 AM
To: 'SQLite mailing list' <[hidden email]>
Subject: Re: [sqlite] [EXTERNAL] Slow joining of tables with indexes

You are using text columns as primary keys and referencing them directly in foreign keys. This is probably not what you want, because it duplicates the text key. Also, with foreign keys enabled, your join is not accomplishing anything more than a direct select from joining_table, just with more effort (and circumventing the count() optimization).

SQLite uses an 64bit Integer as a rowid that uniquely identifies the row in the table. This is what you should be using as a foreign key, because it is twice as fast as using an index.

OTOH, SQLite supports WITHOUT ROWID tables, you might like to read up on those too

-----Ursprüngliche Nachricht-----
Von: sqlite-users [mailto:[hidden email]] Im Auftrag von Jonathan Moules
Gesendet: Dienstag, 26. November 2019 10:25
An: SQLite mailing list <[hidden email]>
Betreff: [EXTERNAL] [sqlite] Slow joining of tables with indexes

Hi List,
I have a relational table setup where I've built indexes but I'm still seeing very slow join times on middling amounts of data. I'm guessing I'm doing something wrong but I can't see what. (SQLite: 3.24.0)

Simplified schema as below.
The ids are 16 character hex strings. I've included the ignore_me table only because it's relevant to the indexes.
Note: I can *guarantee* that the data inserted into `data_table` and `ignore_me` is ordered by their respective primary keys ASC. Entries in joining_table are ordered by one of either data_id ASC or ignored_id ASC depending on creation method.

--==================================

-- 1.7 million items
CREATE TABLE data_table (
     data_id                    TEXT     PRIMARY KEY
                                         NOT NULL
                                         COLLATE NOCASE,
     data_1                        TEXT,
     data_2                         TEXT );

-- 1.9 million items
CREATE TABLE joining_table (
     data_id         TEXT     REFERENCES data_table (data_id)
                             NOT NULL
                             COLLATE NOCASE,
     ignored_id         TEXT    REFERENCES ignore_me (ignored_id)
                             NOT NULL
                             COLLATE NOCASE,
     misc_col_1        TEXT,
     misc_col_2        TEXT
);

-- ~200,000 items
CREATE TABLE ignore_me (
     ignored_id                    TEXT     PRIMARY KEY
                                         NOT NULL
                                         COLLATE NOCASE );

CREATE INDEX IF NOT EXISTS data_table__data_id__pk_idx ON data_table (
     data_id
);
CREATE INDEX IF NOT EXISTS ignore_me__ignored_id__pk_idx ON ignore_me (
     ignored_id
);

-- Allow quick joining from data_table to ignore_me CREATE INDEX IF NOT EXISTS joining_table__data_ignored_id__fk_idx ON joining_table (
     data_id ASC,
     ignored_id ASC
);
-- Allow quick joining from ignore_me to data_table CREATE INDEX IF NOT EXISTS joining_table__ignored_data_id__fk_idx ON joining_table (
     ignored_id ASC,
     data_id ASC
);

-- Example data:

INSERT INTO data_table (data_id) VALUES ('00196a21e8c0f9f6'); INSERT INTO data_table (data_id) VALUES ('579c57f1268c0f5c');

INSERT INTO ignore_me VALUES ('c402eb3f05d433f2'); INSERT INTO ignore_me VALUES ('d827e58953265f63'); INSERT INTO ignore_me VALUES ('ec1d2e817f55b249');

INSERT INTO joining_table (data_id, ignored_id) VALUES ('00196a21e8c0f9f6', 'c402eb3f05d433f2'); INSERT INTO joining_table (data_id, ignored_id) VALUES ('00196a21e8c0f9f6', 'd827e58953265f63'); INSERT INTO joining_table (data_id, ignored_id) VALUES ('579c57f1268c0f5c', 'ec1d2e817f55b249');


--------------------

-- Then to test the speed I'm simply doing:
     SELECT
         count(1)
     FROM
         data_table
         JOIN joining_table USING (data_id);

--==================================

The query plan says it's using the indexes:
     SCAN TABLE joining_table USING COVERING INDEX joining_table__ignored_data_id__fk_idx
     SEARCH TABLE data_table USING COVERING INDEX data_table__data_id__pk_idx (data_id=?)

But it takes about 20 seconds to do that count on the full dataset.

The full EXPLAIN from the full dataset:

0    Init    0    16    0        00
1    Null    0    1    1        00
2    OpenRead    2    771875    0    k(3,NOCASE,NOCASE,)    00
3    OpenRead    3    737715    0    k(2,NOCASE,)    02
4    Rewind    2    12    2    0    00
5    Column    2    1    2        00
6    SeekGE    3    11    2    1    00
7    IdxGT    3    11    2    1    00
8    Integer    1    3    0        00
9    AggStep0    0    3    1    count(1)    01
10    Next    3    7    1        00
11    Next    2    5    0        01
12    AggFinal    1    1    0    count(1)    00
13    Copy    1    4    0        00
14    ResultRow    4    1    0        00
15    Halt    0    0    0        00
16    Transaction    0    0    77    0    01
17    Goto    0    1    0        00

Thoughts? What (probably obvious) thing am I missing?

Thanks,
Jonathan



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


___________________________________________
 Gunter Hick | Software Engineer | Scientific Games International GmbH | Klitschgasse 2-4, A-1130 Vienna | FN 157284 a, HG Wien, DVR: 0430013 | (O) +43 1 80100 - 0

May be privileged. May be confidential. Please delete if not the addressee.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Slow joining of tables with indexes

Jonathan Moules-3
Thanks for the comments. I've done some testing. Results below for those
interested.

* Unnecessary manual indexes on the Primary Key - good spot, I'd
forgotten SQLite does that!

* I was indeed using a Hard Disk but that was intentional - this is for
customers and I can't know their hardware.

* INTEGERs vs WITHOUT ROW_ID vs what I have now vs full-on 64 bit
INTEGERs - Tested below

Non-scientific Timings are below (in seconds). "HDD" = Hard drive,
otherwise it's a SSD. "Indexes" means an index built on the FK.

==========

Original structure

-- original (16 character string PK/FK) - indexes
-- 4.04
-- 4.6 (hdd)
-- 4.1
-- 4.7 (hdd)
-- 4.14
-- 5.03 (hdd)

-- original (16 character string PK/FK) - no indexes
-- 4.03
-- 5.9 (hdd)
-- 5.1
-- 11.4 (hdd)
-- 4.18
-- 9.766 (hdd)

So not much speed difference with indexes between SSD and HDD.

===

Original structure but changing to WITHOUT ROW_ID

-- original (16 character string PK/FK) - WITHOUT ROW_ID - indexes
-- 3.69
-- 2.9 (hdd)
-- 3.8
-- 5.2 (hdd)
-- 3.74
-- 5.8 (hdd)

-- original (16 character string PK/FK) - WITHOUT ROW_ID - no indexes
-- 3.45
-- 3.4 (hdd)
-- 3.4
-- 3.4 (hdd)
-- 8.47
-- 18.4 (hdd)

Curiously with the with-indexes seems to on average be slower than
without indexes for this on the HDD.

======

Auto-incrementing INTEGER as the ID and FK

-- integer_id (autoincrement INTEGER PK/FK) - indexes
-- 1.3
-- 1.21
-- 6.9 (hdd)
-- 1.2
-- 4.4 (hdd)
-- 2.45
-- 5.2 (hdd)

-- integer_id (autoincrement INTEGER PK/FK) - no indexes
-- 1.3
-- 19.3 (hdd)
-- 4.7
-- 9.1 (hdd)
-- 5.229
-- 18.98 (hdd)

no-index speeds seem to be very random.

====

The last test I did was to convert the hex strings to their 64bit
INTEGER equivalents and use those as the keys. So still using a 64bit
INTEGER as the keys, they could be anything rather than low value

So my keys are like:
-9223326038759585676
-5012230838021194131
-3961911462337065450
3423089283580538480
9221679147258515042
...

my integer (Hex to INTEGER PK/FK - negative PKs) - index
-- 2.02
-- 2.03
-- 1.9 (hdd)
-- 1.9 (hdd)
-- 6.1
-- 1.9 (hdd)

my integer (Hex to INTEGER PK/FK - negative PKs) - no indexes
-- 2.48s
-- 2.42s
-- 2.4 (hdd)
-- 2.4 (hdd)
-- 7.5
-- 20.1 (hdd)

The HDD was consistently good with these full-size 64bit keys which
surprised me. I've seen that there are some optimisations assuming
positive integers -
http://peterhansen.ca/blog/sqlite-negative-integer-primary-keys.html -
but it's odd that the HDD was better than the SSD for the most part with
these.

Also the full-size 64bit integers were a fair percentage slower than the
regular integers even though there were the exact same number of them.


Thanks again,

Jonathan


On 2019-11-26 14:40, David Raymond wrote:

> Not the reason for the slowdown, but note that both of these are redundant:
>
> CREATE INDEX IF NOT EXISTS data_table__data_id__pk_idx ON data_table (
>       data_id
> );
> CREATE INDEX IF NOT EXISTS ignore_me__ignored_id__pk_idx ON ignore_me (
>       ignored_id
> );
>
> ...because you declared them as the primary keys in the table creation. So you now have 2 different indexes on the exact same data for each of those.
>
>
> The rest of it looks fine to me anyway, and I'm not sure why you'd be seeing such slow times. Old slow hard disk?
>
> If you analyze and vacuum it does it get any better?
>
> I think the CLI has something like ".scanstats on" to get a little more info, but I'm not sure how much more info it'll provide.
>
>
> -----Original Message-----
> From: sqlite-users <[hidden email]> On Behalf Of Hick Gunter
> Sent: Tuesday, November 26, 2019 4:57 AM
> To: 'SQLite mailing list' <[hidden email]>
> Subject: Re: [sqlite] [EXTERNAL] Slow joining of tables with indexes
>
> You are using text columns as primary keys and referencing them directly in foreign keys. This is probably not what you want, because it duplicates the text key. Also, with foreign keys enabled, your join is not accomplishing anything more than a direct select from joining_table, just with more effort (and circumventing the count() optimization).
>
> SQLite uses an 64bit Integer as a rowid that uniquely identifies the row in the table. This is what you should be using as a foreign key, because it is twice as fast as using an index.
>
> OTOH, SQLite supports WITHOUT ROWID tables, you might like to read up on those too
>
> -----Ursprüngliche Nachricht-----
> Von: sqlite-users [mailto:[hidden email]] Im Auftrag von Jonathan Moules
> Gesendet: Dienstag, 26. November 2019 10:25
> An: SQLite mailing list <[hidden email]>
> Betreff: [EXTERNAL] [sqlite] Slow joining of tables with indexes
>
> Hi List,
> I have a relational table setup where I've built indexes but I'm still seeing very slow join times on middling amounts of data. I'm guessing I'm doing something wrong but I can't see what. (SQLite: 3.24.0)
>
> Simplified schema as below.
> The ids are 16 character hex strings. I've included the ignore_me table only because it's relevant to the indexes.
> Note: I can *guarantee* that the data inserted into `data_table` and `ignore_me` is ordered by their respective primary keys ASC. Entries in joining_table are ordered by one of either data_id ASC or ignored_id ASC depending on creation method.
>
> --==================================
>
> -- 1.7 million items
> CREATE TABLE data_table (
>       data_id                    TEXT     PRIMARY KEY
>                                           NOT NULL
>                                           COLLATE NOCASE,
>       data_1                        TEXT,
>       data_2                         TEXT );
>
> -- 1.9 million items
> CREATE TABLE joining_table (
>       data_id         TEXT     REFERENCES data_table (data_id)
>                               NOT NULL
>                               COLLATE NOCASE,
>       ignored_id         TEXT    REFERENCES ignore_me (ignored_id)
>                               NOT NULL
>                               COLLATE NOCASE,
>       misc_col_1        TEXT,
>       misc_col_2        TEXT
> );
>
> -- ~200,000 items
> CREATE TABLE ignore_me (
>       ignored_id                    TEXT     PRIMARY KEY
>                                           NOT NULL
>                                           COLLATE NOCASE );
>
> CREATE INDEX IF NOT EXISTS data_table__data_id__pk_idx ON data_table (
>       data_id
> );
> CREATE INDEX IF NOT EXISTS ignore_me__ignored_id__pk_idx ON ignore_me (
>       ignored_id
> );
>
> -- Allow quick joining from data_table to ignore_me CREATE INDEX IF NOT EXISTS joining_table__data_ignored_id__fk_idx ON joining_table (
>       data_id ASC,
>       ignored_id ASC
> );
> -- Allow quick joining from ignore_me to data_table CREATE INDEX IF NOT EXISTS joining_table__ignored_data_id__fk_idx ON joining_table (
>       ignored_id ASC,
>       data_id ASC
> );
>
> -- Example data:
>
> INSERT INTO data_table (data_id) VALUES ('00196a21e8c0f9f6'); INSERT INTO data_table (data_id) VALUES ('579c57f1268c0f5c');
>
> INSERT INTO ignore_me VALUES ('c402eb3f05d433f2'); INSERT INTO ignore_me VALUES ('d827e58953265f63'); INSERT INTO ignore_me VALUES ('ec1d2e817f55b249');
>
> INSERT INTO joining_table (data_id, ignored_id) VALUES ('00196a21e8c0f9f6', 'c402eb3f05d433f2'); INSERT INTO joining_table (data_id, ignored_id) VALUES ('00196a21e8c0f9f6', 'd827e58953265f63'); INSERT INTO joining_table (data_id, ignored_id) VALUES ('579c57f1268c0f5c', 'ec1d2e817f55b249');
>
>
> --------------------
>
> -- Then to test the speed I'm simply doing:
>       SELECT
>           count(1)
>       FROM
>           data_table
>           JOIN joining_table USING (data_id);
>
> --==================================
>
> The query plan says it's using the indexes:
>       SCAN TABLE joining_table USING COVERING INDEX joining_table__ignored_data_id__fk_idx
>       SEARCH TABLE data_table USING COVERING INDEX data_table__data_id__pk_idx (data_id=?)
>
> But it takes about 20 seconds to do that count on the full dataset.
>
> The full EXPLAIN from the full dataset:
>
> 0    Init    0    16    0        00
> 1    Null    0    1    1        00
> 2    OpenRead    2    771875    0    k(3,NOCASE,NOCASE,)    00
> 3    OpenRead    3    737715    0    k(2,NOCASE,)    02
> 4    Rewind    2    12    2    0    00
> 5    Column    2    1    2        00
> 6    SeekGE    3    11    2    1    00
> 7    IdxGT    3    11    2    1    00
> 8    Integer    1    3    0        00
> 9    AggStep0    0    3    1    count(1)    01
> 10    Next    3    7    1        00
> 11    Next    2    5    0        01
> 12    AggFinal    1    1    0    count(1)    00
> 13    Copy    1    4    0        00
> 14    ResultRow    4    1    0        00
> 15    Halt    0    0    0        00
> 16    Transaction    0    0    77    0    01
> 17    Goto    0    1    0        00
>
> Thoughts? What (probably obvious) thing am I missing?
>
> Thanks,
> Jonathan
>
>
>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
>
> ___________________________________________
>   Gunter Hick | Software Engineer | Scientific Games International GmbH | Klitschgasse 2-4, A-1130 Vienna | FN 157284 a, HG Wien, DVR: 0430013 | (O) +43 1 80100 - 0
>
> May be privileged. May be confidential. Please delete if not the addressee.
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Slow joining of tables with indexes

Jonathan Moules-3
Hi List,

So, I've altered my structure to be INTEGER primary keys, but I'm still
seeing very slow query times when joining. The original query is faster:

       SELECT
           count(1)
       FROM
           data_table
           JOIN joining_table USING (data_id);

It takes ~2s, but if I then join on to the next table (ignore_me - only
~200,000 records), it goes up to a whopping 27s - and this is on the SSD!

SELECT

count(1)

FROM

data_table

     JOIN joining_table USING (data_id)

     JOIN ignore_me USING (ignored_id)

;

I can see it's using the indexes, but strangely the index it's using on
the ignore_me table isn't the PK index but a FK index (INTEGER) to the
next table in the sequence (not included in this schema):

5    0    0    SCAN TABLE ignore_me USING COVERING INDEX
ignored__e_id__fk_idx
7    0    0    SEARCH TABLE data_to_ignored USING COVERING INDEX
joining_table__ignored_data_id__fk_idx (s_id=?)
11    0    0    SEARCH TABLE data USING INTEGER PRIMARY KEY (rowid=?)

Any thoughts? This seems like relational-database bread-and-butter so
I'm sure I'm doing something wrong to be getting these slow speeds but I
can't see what.

Thanks,

Jonathan

=============

Schema now:

-- 1.7 million items

CREATE TABLE data_table (

data_id INTEGER PRIMARY KEY,

data_1 TEXT,

data_2 TEXT );


-- 1.9 million items

CREATE TABLE joining_table (

data_id INTEGER REFERENCES data_table (data_id),

ignored_id INTEGER REFERENCES ignore_me (ignored_id),

misc_col_1 TEXT,

misc_col_2 TEXT

);


-- ~200,000 items

CREATE TABLE ignore_me (

ignored_id INTEGER PRIMARY KEY,

ignored_col TEXT

);


-- Allow quick joining from data_table to ignore_me

CREATE INDEX IF NOT EXISTS joining_table__data_ignored_id__fk_idx ON
joining_table (

data_id ASC,

ignored_id ASC

);

-- Allow quick joining from ignore_me to data_table

CREATE INDEX IF NOT EXISTS joining_table__ignored_data_id__fk_idx ON
joining_table (

ignored_id ASC,

data_id ASC

);


-- Example data:


INSERT INTO data_table (data_id) VALUES (1); INSERT INTO data_table
(data_id) VALUES (2);


INSERT INTO ignore_me VALUES (1, 'words'); INSERT INTO ignore_me VALUES
(2, 'more words'); INSERT INTO ignore_me VALUES (3, 'yet more words');


INSERT INTO joining_table (data_id, ignored_id) VALUES (1, 1); INSERT
INTO joining_table (data_id, ignored_id) VALUES (1, 2); INSERT INTO
joining_table (data_id, ignored_id) VALUES (2, 3);



SELECT

count(1)

FROM

data_table

     JOIN joining_table USING (data_id)

     JOIN ignore_me USING (ignored_id)

;






On 2019-12-02 13:42, Jonathan Moules wrote:

> Thanks for the comments. I've done some testing. Results below for
> those interested.
>
> * Unnecessary manual indexes on the Primary Key - good spot, I'd
> forgotten SQLite does that!
>
> * I was indeed using a Hard Disk but that was intentional - this is
> for customers and I can't know their hardware.
>
> * INTEGERs vs WITHOUT ROW_ID vs what I have now vs full-on 64 bit
> INTEGERs - Tested below
>
> Non-scientific Timings are below (in seconds). "HDD" = Hard drive,
> otherwise it's a SSD. "Indexes" means an index built on the FK.
>
> ==========
>
> Original structure
>
> -- original (16 character string PK/FK) - indexes
> -- 4.04
> -- 4.6 (hdd)
> -- 4.1
> -- 4.7 (hdd)
> -- 4.14
> -- 5.03 (hdd)
>
> -- original (16 character string PK/FK) - no indexes
> -- 4.03
> -- 5.9 (hdd)
> -- 5.1
> -- 11.4 (hdd)
> -- 4.18
> -- 9.766 (hdd)
>
> So not much speed difference with indexes between SSD and HDD.
>
> ===
>
> Original structure but changing to WITHOUT ROW_ID
>
> -- original (16 character string PK/FK) - WITHOUT ROW_ID - indexes
> -- 3.69
> -- 2.9 (hdd)
> -- 3.8
> -- 5.2 (hdd)
> -- 3.74
> -- 5.8 (hdd)
>
> -- original (16 character string PK/FK) - WITHOUT ROW_ID - no indexes
> -- 3.45
> -- 3.4 (hdd)
> -- 3.4
> -- 3.4 (hdd)
> -- 8.47
> -- 18.4 (hdd)
>
> Curiously with the with-indexes seems to on average be slower than
> without indexes for this on the HDD.
>
> ======
>
> Auto-incrementing INTEGER as the ID and FK
>
> -- integer_id (autoincrement INTEGER PK/FK) - indexes
> -- 1.3
> -- 1.21
> -- 6.9 (hdd)
> -- 1.2
> -- 4.4 (hdd)
> -- 2.45
> -- 5.2 (hdd)
>
> -- integer_id (autoincrement INTEGER PK/FK) - no indexes
> -- 1.3
> -- 19.3 (hdd)
> -- 4.7
> -- 9.1 (hdd)
> -- 5.229
> -- 18.98 (hdd)
>
> no-index speeds seem to be very random.
>
> ====
>
> The last test I did was to convert the hex strings to their 64bit
> INTEGER equivalents and use those as the keys. So still using a 64bit
> INTEGER as the keys, they could be anything rather than low value
>
> So my keys are like:
> -9223326038759585676
> -5012230838021194131
> -3961911462337065450
> 3423089283580538480
> 9221679147258515042
> ...
>
> my integer (Hex to INTEGER PK/FK - negative PKs) - index
> -- 2.02
> -- 2.03
> -- 1.9 (hdd)
> -- 1.9 (hdd)
> -- 6.1
> -- 1.9 (hdd)
>
> my integer (Hex to INTEGER PK/FK - negative PKs) - no indexes
> -- 2.48s
> -- 2.42s
> -- 2.4 (hdd)
> -- 2.4 (hdd)
> -- 7.5
> -- 20.1 (hdd)
>
> The HDD was consistently good with these full-size 64bit keys which
> surprised me. I've seen that there are some optimisations assuming
> positive integers -
> http://peterhansen.ca/blog/sqlite-negative-integer-primary-keys.html -
> but it's odd that the HDD was better than the SSD for the most part
> with these.
>
> Also the full-size 64bit integers were a fair percentage slower than
> the regular integers even though there were the exact same number of
> them.
>
>
> Thanks again,
>
> Jonathan
>
>
> On 2019-11-26 14:40, David Raymond wrote:
>> Not the reason for the slowdown, but note that both of these are
>> redundant:
>>
>> CREATE INDEX IF NOT EXISTS data_table__data_id__pk_idx ON data_table (
>>       data_id
>> );
>> CREATE INDEX IF NOT EXISTS ignore_me__ignored_id__pk_idx ON ignore_me (
>>       ignored_id
>> );
>>
>> ...because you declared them as the primary keys in the table
>> creation. So you now have 2 different indexes on the exact same data
>> for each of those.
>>
>>
>> The rest of it looks fine to me anyway, and I'm not sure why you'd be
>> seeing such slow times. Old slow hard disk?
>>
>> If you analyze and vacuum it does it get any better?
>>
>> I think the CLI has something like ".scanstats on" to get a little
>> more info, but I'm not sure how much more info it'll provide.
>>
>>
>> -----Original Message-----
>> From: sqlite-users <[hidden email]> On
>> Behalf Of Hick Gunter
>> Sent: Tuesday, November 26, 2019 4:57 AM
>> To: 'SQLite mailing list' <[hidden email]>
>> Subject: Re: [sqlite] [EXTERNAL] Slow joining of tables with indexes
>>
>> You are using text columns as primary keys and referencing them
>> directly in foreign keys. This is probably not what you want, because
>> it duplicates the text key. Also, with foreign keys enabled, your
>> join is not accomplishing anything more than a direct select from
>> joining_table, just with more effort (and circumventing the count()
>> optimization).
>>
>> SQLite uses an 64bit Integer as a rowid that uniquely identifies the
>> row in the table. This is what you should be using as a foreign key,
>> because it is twice as fast as using an index.
>>
>> OTOH, SQLite supports WITHOUT ROWID tables, you might like to read up
>> on those too
>>
>> -----Ursprüngliche Nachricht-----
>> Von: sqlite-users
>> [mailto:[hidden email]] Im Auftrag von
>> Jonathan Moules
>> Gesendet: Dienstag, 26. November 2019 10:25
>> An: SQLite mailing list <[hidden email]>
>> Betreff: [EXTERNAL] [sqlite] Slow joining of tables with indexes
>>
>> Hi List,
>> I have a relational table setup where I've built indexes but I'm
>> still seeing very slow join times on middling amounts of data. I'm
>> guessing I'm doing something wrong but I can't see what. (SQLite:
>> 3.24.0)
>>
>> Simplified schema as below.
>> The ids are 16 character hex strings. I've included the ignore_me
>> table only because it's relevant to the indexes.
>> Note: I can *guarantee* that the data inserted into `data_table` and
>> `ignore_me` is ordered by their respective primary keys ASC. Entries
>> in joining_table are ordered by one of either data_id ASC or
>> ignored_id ASC depending on creation method.
>>
>> --==================================
>>
>> -- 1.7 million items
>> CREATE TABLE data_table (
>>       data_id                    TEXT     PRIMARY KEY
>>                                           NOT NULL
>>                                           COLLATE NOCASE,
>>       data_1                        TEXT,
>>       data_2                         TEXT );
>>
>> -- 1.9 million items
>> CREATE TABLE joining_table (
>>       data_id         TEXT     REFERENCES data_table (data_id)
>>                               NOT NULL
>>                               COLLATE NOCASE,
>>       ignored_id         TEXT    REFERENCES ignore_me (ignored_id)
>>                               NOT NULL
>>                               COLLATE NOCASE,
>>       misc_col_1        TEXT,
>>       misc_col_2        TEXT
>> );
>>
>> -- ~200,000 items
>> CREATE TABLE ignore_me (
>>       ignored_id                    TEXT     PRIMARY KEY
>>                                           NOT NULL
>>                                           COLLATE NOCASE );
>>
>> CREATE INDEX IF NOT EXISTS data_table__data_id__pk_idx ON data_table (
>>       data_id
>> );
>> CREATE INDEX IF NOT EXISTS ignore_me__ignored_id__pk_idx ON ignore_me (
>>       ignored_id
>> );
>>
>> -- Allow quick joining from data_table to ignore_me CREATE INDEX IF
>> NOT EXISTS joining_table__data_ignored_id__fk_idx ON joining_table (
>>       data_id ASC,
>>       ignored_id ASC
>> );
>> -- Allow quick joining from ignore_me to data_table CREATE INDEX IF
>> NOT EXISTS joining_table__ignored_data_id__fk_idx ON joining_table (
>>       ignored_id ASC,
>>       data_id ASC
>> );
>>
>> -- Example data:
>>
>> INSERT INTO data_table (data_id) VALUES ('00196a21e8c0f9f6'); INSERT
>> INTO data_table (data_id) VALUES ('579c57f1268c0f5c');
>>
>> INSERT INTO ignore_me VALUES ('c402eb3f05d433f2'); INSERT INTO
>> ignore_me VALUES ('d827e58953265f63'); INSERT INTO ignore_me VALUES
>> ('ec1d2e817f55b249');
>>
>> INSERT INTO joining_table (data_id, ignored_id) VALUES
>> ('00196a21e8c0f9f6', 'c402eb3f05d433f2'); INSERT INTO joining_table
>> (data_id, ignored_id) VALUES ('00196a21e8c0f9f6',
>> 'd827e58953265f63'); INSERT INTO joining_table (data_id, ignored_id)
>> VALUES ('579c57f1268c0f5c', 'ec1d2e817f55b249');
>>
>>
>> --------------------
>>
>> -- Then to test the speed I'm simply doing:
>>       SELECT
>>           count(1)
>>       FROM
>>           data_table
>>           JOIN joining_table USING (data_id);
>>
>> --==================================
>>
>> The query plan says it's using the indexes:
>>       SCAN TABLE joining_table USING COVERING INDEX
>> joining_table__ignored_data_id__fk_idx
>>       SEARCH TABLE data_table USING COVERING INDEX
>> data_table__data_id__pk_idx (data_id=?)
>>
>> But it takes about 20 seconds to do that count on the full dataset.
>>
>> The full EXPLAIN from the full dataset:
>>
>> 0    Init    0    16    0        00
>> 1    Null    0    1    1        00
>> 2    OpenRead    2    771875    0    k(3,NOCASE,NOCASE,)    00
>> 3    OpenRead    3    737715    0    k(2,NOCASE,)    02
>> 4    Rewind    2    12    2    0    00
>> 5    Column    2    1    2        00
>> 6    SeekGE    3    11    2    1    00
>> 7    IdxGT    3    11    2    1    00
>> 8    Integer    1    3    0        00
>> 9    AggStep0    0    3    1    count(1)    01
>> 10    Next    3    7    1        00
>> 11    Next    2    5    0        01
>> 12    AggFinal    1    1    0    count(1)    00
>> 13    Copy    1    4    0        00
>> 14    ResultRow    4    1    0        00
>> 15    Halt    0    0    0        00
>> 16    Transaction    0    0    77    0    01
>> 17    Goto    0    1    0        00
>>
>> Thoughts? What (probably obvious) thing am I missing?
>>
>> Thanks,
>> Jonathan
>>
>>
>>
>> _______________________________________________
>> sqlite-users mailing list
>> [hidden email]
>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>>
>>
>> ___________________________________________
>>   Gunter Hick | Software Engineer | Scientific Games International
>> GmbH | Klitschgasse 2-4, A-1130 Vienna | FN 157284 a, HG Wien, DVR:
>> 0430013 | (O) +43 1 80100 - 0
>>
>> May be privileged. May be confidential. Please delete if not the
>> addressee.
>> _______________________________________________
>> sqlite-users mailing list
>> [hidden email]
>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>> _______________________________________________
>> sqlite-users mailing list
>> [hidden email]
>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Slow joining of tables with indexes

Simon Slavin-3
On 3 Dec 2019, at 8:48am, Jonathan Moules <[hidden email]> wrote:

>       SELECT
>           count(1)
>       FROM
>           data_table
>           JOIN joining_table USING (data_id);

      SELECT
          count(1)
      FROM data_table
          JOIN joining_table
              ON joining_table.data_id = data_table.data_id;

Given the rest of the structure you gave, including the indexes, compare the speeds of these two.

Simon
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Slow joining of tables with indexes

Jonathan Moules-3
Thanks for the suggestion.
Changing the USING to ON makes absolutely no difference. The speed is
the same and the query plans (both EXPLAIN and EXPLAIN QUERY PLAN) are
absolutely identical. Same for if I convert it to WHERE:

WHERE joining_table.data_id = data_table.data_id;

On 2019-12-03 14:46, Simon Slavin wrote:

> On 3 Dec 2019, at 8:48am, Jonathan Moules <[hidden email]> wrote:
>
>>        SELECT
>>            count(1)
>>        FROM
>>            data_table
>>            JOIN joining_table USING (data_id);
>        SELECT
>            count(1)
>        FROM data_table
>            JOIN joining_table
>                ON joining_table.data_id = data_table.data_id;
>
> Given the rest of the structure you gave, including the indexes, compare the speeds of these two.
>
> Simon
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users