Join, Union, Subquery or what?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
27 messages Options
12
Joe
Reply | Threaded
Open this post in threaded view
|

Join, Union, Subquery or what?

Joe
Hi, all,

my SQLite database has two tables Katalog and ZKatalog with the same
structure. One of the columns  is called DDatum. What's the most
efficient way to

(1) Select records, which are only in Katalog, but not in ZKatalog?
(2) Select records, which are in Katalog and in ZKatalog?
(3) Select records, which are with same column values except DDatum in
Katalog and in ZKatalog?
(4) Select records from Katalog and ZKatalog with same DDatum content?

Thanx --  Joe
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Join, Union, Subquery or what?

Clemens Ladisch
Joe wrote:
> my SQLite database has two tables Katalog and ZKatalog with the same structure. One of the columns  is called DDatum. What's the most efficient way to
>
> (1) Select records, which are only in Katalog, but not in ZKatalog?

SELECT * FROM Katalog EXCEPT SELECT * FROM ZKatalog;

> (2) Select records, which are in Katalog and in ZKatalog?

SELECT * FROM Katalog INTERSECT SELECT * FROM ZKatalog;
SELECT * FROM Katalog UNION SELECT * FROM ZKatalog;
(depending on the exact meaning of the word "and" in your question)

> (3) Select records, which are with same column values except DDatum in Katalog and in ZKatalog?

SELECT *
FROM Katalog
WHERE the_primary_key_column IN (
  SELECT the_primary_key_column FROM (
    SELECT all,columns,except,ddatum FROM Katalog
    INTERSECT
    SELECT all,columns,except,ddatum FROM ZKatalog
  )
);

(If you do not need the DDatum values, use only the inner subquery.)

> (4) Select records from Katalog and ZKatalog with same DDatum content?

SELECT * FROM Katalog WHERE DDatum IN (SELECT DDatum FROM ZKatalog);


Regards,
Clemens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Join, Union, Subquery or what?

R Smith
These suggestions from Clemens will work exactly as you want, but I need
to add that it assumes the records all perfectly match between the
tables, even flags, ID column etc.

This means, if it doesn't work as you expect, you can still use the
exact same methods but you can of course include the relevant columns
(that should or shouldn't match between tables) in stead of the * wildcard.

As an example based on Clemens' first example, the query:

SELECT DDatum FROM Katalog EXCEPT SELECT DDatum FROM ZKatalog;

will produce a list of all dates in Katalog that doesn't have a
corresponding date in ZKatalog.

Cheers,
Ryan


On 2017/09/06 1:09 PM, Clemens Ladisch wrote:

> Joe wrote:
>> my SQLite database has two tables Katalog and ZKatalog with the same structure. One of the columns  is called DDatum. What's the most efficient way to
>>
>> (1) Select records, which are only in Katalog, but not in ZKatalog?
> SELECT * FROM Katalog EXCEPT SELECT * FROM ZKatalog;
>
>> (2) Select records, which are in Katalog and in ZKatalog?
> SELECT * FROM Katalog INTERSECT SELECT * FROM ZKatalog;
> SELECT * FROM Katalog UNION SELECT * FROM ZKatalog;
> (depending on the exact meaning of the word "and" in your question)
>
>> (3) Select records, which are with same column values except DDatum in Katalog and in ZKatalog?
> SELECT *
> FROM Katalog
> WHERE the_primary_key_column IN (
>    SELECT the_primary_key_column FROM (
>      SELECT all,columns,except,ddatum FROM Katalog
>      INTERSECT
>      SELECT all,columns,except,ddatum FROM ZKatalog
>    )
> );
>
> (If you do not need the DDatum values, use only the inner subquery.)
>
>> (4) Select records from Katalog and ZKatalog with same DDatum content?
> SELECT * FROM Katalog WHERE DDatum IN (SELECT DDatum FROM ZKatalog);
>
>
> Regards,
> Clemens
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Joe
Reply | Threaded
Open this post in threaded view
|

Re: Join, Union, Subquery or what?

Joe
In reply to this post by Joe
Am 06.09.2017 um 14:32 schrieb R Smith:

> These suggestions from Clemens will work exactly as you want, but I
> need to add that it assumes the records all perfectly match between
> the tables, even flags, ID column etc.
>
> This means, if it doesn't work as you expect, you can still use the
> exact same methods but you can of course include the relevant columns
> (that should or shouldn't match between tables) in stead of the *
> wildcard.
>
> As an example based on Clemens' first example, the query:
>
> SELECT DDatum FROM Katalog EXCEPT SELECT DDatum FROM ZKatalog;
>
> will produce a list of all dates in Katalog that doesn't have a
> corresponding date in ZKatalog.
>
> Cheers,
> Ryan
>
>
> On 2017/09/06 1:09 PM, Clemens Ladisch wrote:
>> Joe wrote:
>>> my SQLite database has two tables Katalog and ZKatalog with the same
>>> structure. One of the columns  is called DDatum. What's the most
>>> efficient way to
>>>
>>> (1) Select records, which are only in Katalog, but not in ZKatalog?
>> SELECT * FROM Katalog EXCEPT SELECT * FROM ZKatalog;
>>
>>> (2) Select records, which are in Katalog and in ZKatalog?
>> SELECT * FROM Katalog INTERSECT SELECT * FROM ZKatalog;
>> SELECT * FROM Katalog UNION SELECT * FROM ZKatalog;
>> (depending on the exact meaning of the word "and" in your question)
>>
>>> (3) Select records, which are with same column values except DDatum
>>> in Katalog and in ZKatalog?
>> SELECT *
>> FROM Katalog
>> WHERE the_primary_key_column IN (
>>    SELECT the_primary_key_column FROM (
>>      SELECT all,columns,except,ddatum FROM Katalog
>>      INTERSECT
>>      SELECT all,columns,except,ddatum FROM ZKatalog
>>    )
>> );
>>
>> (If you do not need the DDatum values, use only the inner subquery.)
>>
>>> (4) Select records from Katalog and ZKatalog with same DDatum content?
>> SELECT * FROM Katalog WHERE DDatum IN (SELECT DDatum FROM ZKatalog);
>>
>>
>> Regards,
>> Clemens
>> _______________________________________________
>> sqlite-users mailing list
>> [hidden email]
>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
.
Clemens and Ryan, this helps a lot.
Thank you --  Joe
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

JOIN vs. INTERSECT speed and efficiency

R Smith
In reply to this post by R Smith
Hi all,

For those interested, after a recent thread from a poster called Joe
asking about the most efficient way to find values that coincide from
two separate tables, a response from Clemens Ladisch and a further
elaboration from myself suggested the following:

SELECT v FROM t1 INTERSECT SELECT v FROM t2

I was going to mention that that INTERSECT is simply equivalent to a
JOIN on v (if there are no duplicate entries) of the form:

SELECT v FROM t1 JOIN t2 ON t2.v = t1.v

But then the question arose: Which is more efficient? - So at the time
of posting I refrained from mentioning it, but it kept in my mind, until
I decided to try and answer the question.

Next step was to set up an experiment to mimic the above and test it.


*Experiment* - If the reader is interested, or would like to compare and
contrast with possibly different DB engines, versions, platforms or
settings, read on, else just skip ahead to the Conclusion bit.

Setup:
- Using SQLite.
- Ran on my slowest (to magnify efficiency difference) notebook (Win10
64bit)
- using sqlitespeed (http://www.sqlc.rifin.co.za/) but you can use any
system since the objective measurement here is qualitative difference
ratio, not quantitative direct units.
- with the standard pre-compiled 32-bit SQLite DLL available from the
SQLite download page (https://sqlite.org/download.html)
- test DLL version: 3.17.0


Tests:

I will post the first test script in full with added annotation to
explain the rationale and show the full working, but the subsequent
scripts I will only post the money-shots and avoid the repetitive parts.
Note that the test script is run in a transaction which is not shown
explicitly and at the end rolled back to avoid differences in tests on
account of the file content changing during the tests.

Test 1:

   -- SQLite version 3.17.0  [ Release: 2017-02-13 ]  on SQLitespeed
version 2.0.2.4.
   --
================================================================================================

-- This first table holds Parameters for testing.
-- idxCount specifies how many entries to make in the test tables. (i.e.
Number of added rows).
-- rndDiv is the modality chosen for the random values to control the
magnitude of random values
-- so that the ratio of expected matching values to value-count can be
controlled.

CREATE TABLE p (
   idxCount INT,
   rndDiv INT
);

-- Here we add the Test parameter values. This is the only section that
changes between
-- the first 4 tests.
-- This test has a row-count of 100000 (One-Hundred-Thousand) and a
modality of 1000 times the
-- idxCount so that in all the rows, we expect roughly only about 0.1%
of matches.
-- Note that if every value in table 1 has a 1-in-1000 chance of having
a matching valuein
-- table 2, then for 100,000 rows there should be ~100 matches which is
~0.1% of the row-count.

-- This test was picked as the full example since it produces the least
amount of rows.

INSERT INTO p(idxCount, rndDiv) VALUES (100000, 1000 * 100000);

-- Confirm the parameters.
SELECT * FROM p;

   --   idxCount   |    rndDiv
   -- ------------ | ------------
   --    100000    |   100000000


-- Following are the two test tables t1 and t2 each containing one
column v to hold the random values.
CREATE TABLE t1 (
   v INT
);

CREATE TABLE t2 (
   v INT
);

-- Here we fill the two test tables according to our parameters above.
WITH R(idx,Val) AS (
     SELECT 0, 1000
     UNION ALL
     SELECT idx+1, abs(random() % p.rndDiv) FROM R,p WHERE idx < p.idxCount
)
INSERT INTO t1(v) SELECT Val FROM R
;

WITH R(idx,Val) AS (
     SELECT 0, 1000
     UNION ALL
     SELECT idx+1, abs(random() % p.rndDiv) FROM R,p WHERE idx < p.idxCount
)
INSERT INTO t2(v) SELECT Val FROM R
;

-- Test 1 - using JOIN.
-- Note that this is the second iteration of tests. In the first
iteration I had the INTERSECT
-- query before the JOIN query, but I swapped it around for the second
iteration because
-- the JOIN was generally faster for low match counts so I wanted to
ensure neither test
-- gained an advantage from being second due to some or other caching by
the Query planner.

SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v;

   --     v
   -- --------
   --     1000
   --  2312703
   -- 87720925
   -- 29736409
   -- 62527166
   -- 25171143
   -- 24168552
   -- 86449735
   -- 83544235
   -- 45671286
   -- 69343788
   -- 42394827
   -- 92142603
   -- 87106564
   --  4593574
   --  6914348
   -- 16358938
   -- 12568863
   -- 20105830
   -- 91354724
   -- 87992157
   -- 17605134
   -- 28584588
   -- 78633251
   -- 98955905
   -- 19979768
   -- 20956231
   -- 30819730
   -- 93942875
   -- 45346494
   -- 96346064
   -- 32224203
   -- 89622511
   -- 39267531
   --  3116133
   -- 31172079
   -- 87828771
   -- 82931503
   -- 89108957
   -- 80067973
   -- 89366000
   -- 68319117
   -- 37802556
   -- 64391927
   -- 84515054
   -- 11071461
   -- 40682706
   -- 78441313
   -- 17977211
   --   659811
   -- 14504321
   -- 57479870
   -- 44134958
   -- 94642155
   -- 37520503
   -- 48456547
   -- 99084161
   -- 84938198
   -- 14002200
   -- 78221942
   -- 45576636
   -- 42790976
   --  6533108
   -- 92535066
   -- 89299140
   -- 22666602
   -- 63098262
   -- 69021687
   -- 88420428
   -- 20451211
   -- 30003757
   -- 90711051
   -- 43222100
   -- 30233728
   -- 61634416
   -- 86830002
   -- 40248177
   --  1237709
   -- 58707473
   -- 39900130
   -- 92531343
   -- 78010626
   -- 67793752
   -- 29390573
   -- 23079807
   -- 80883211
   -- 15167730
   -- 15532946
   -- 84444860
   -- 89113099
   -- 70441636
   -- 47253234
   -- 49325743
   -- 55033258
   -- 90556390
   -- 86704459
   -- 45170002
   -- 47126034
   --  9496434
   -- 42163489
   -- 92938539
   -- 55105343
   -- 31110805
   -- 80062347
   -- 65114976
   --  8309007
   -- 78366148
   -- 44587619
   -- 27312379
   -- 56915827
   -- 57706349
   -- 84004545
   -- 52791919
   --  8606523
   -- 27763198
   -- 21761548
   -- 69642690

   --    Item Stats:  Item No:           8 Query Size (Chars):  45
   --                 Result Columns:    1 Result Rows:         117
   --                 VM Work Steps:     1200488 Rows Modified:       0
   --                 Full Query Time:   0d 00h 00m and 00.177s
   --                 Query Result:      Success.
   --
------------------------------------------------------------------------------------------------


-- The INTERSECT test:

SELECT v FROM t1 INTERSECT SELECT v FROM t2;

   --       v
   -- ------------
   --       1000
   --     659811
   --    1237709
   --    2312703
   --    3116133
   --    4593574
   --    6533108
   --    6914348
   --    8309007
   --    8606523
   --    9496434
   --   11071461
   --   12568863
   --   14002200
   --   14504321
   --   15167730
   --   15532946
   --   16358938
   --   17605134
   --   17977211
   --   19979768
   --   20105830
   --   20451211
   --   20956231
   --   21761548
   --   22666602
   --   23079807
   --   24168552
   --   25171143
   --   27312379
   --   27763198
   --   28584588
   --   29390573
   --   29736409
   --   30003757
   --   30233728
   --   30819730
   --   31110805
   --   31172079
   --   32224203
   --   37520503
   --   37802556
   --   39267531
   --   39900130
   --   40248177
   --   40682706
   --   42163489
   --   42394827
   --   42790976
   --   43222100
   --   44134958
   --   44587619
   --   45170002
   --   45346494
   --   45576636
   --   45671286
   --   47126034
   --   47253234
   --   48456547
   --   49325743
   --   52791919
   --   55033258
   --   55105343
   --   56915827
   --   57479870
   --   57706349
   --   58707473
   --   61634416
   --   62527166
   --   63098262
   --   64391927
   --   65114976
   --   67793752
   --   68319117
   --   69021687
   --   69343788
   --   69642690
   --   70441636
   --   78010626
   --   78221942
   --   78366148
   --   78441313
   --   78633251
   --   80062347
   --   80067973
   --   80883211
   --   82931503
   --   83544235
   --   84004545
   --   84444860
   --   84515054
   --   84938198
   --   86449735
   --   86704459
   --   86830002
   --   87106564
   --   87720925
   --   87828771
   --   87992157
   --   88420428
   --   89108957
   --   89113099
   --   89299140
   --   89366000
   --   89622511
   --   90556390
   --   90711051
   --   91354724
   --   92142603
   --   92531343
   --   92535066
   --   92938539
   --   93942875
   --   94642155
   --   96346064
   --   98955905
   --   99084161

   --    Item Stats:  Item No:           9 Query Size (Chars):  46
   --                 Result Columns:    1 Result Rows:         117
   --                 VM Work Steps:     1100093 Rows Modified:       0
   --                 Full Query Time:   0d 00h 00m and 00.228s
   --                 Query Result:      Success.
   --
------------------------------------------------------------------------------------------------

-- As you can see, the JOIN is roughly 30% faster than INTERSECT (0.177s
vs. 0.228s) for this test.
-- Also note that the Virtual-Machine work steps needed to accomplish
the result differed
-- between the two tests (1.2 million vs. 1.1 million) which is very
interesting because the
-- JOIN used ~9% more VM steps to complete (i.e. less efficient), but
did so in a faster time.

-- Another interesting thing to note: The INTERSECT test produces
ORDERED output, which
-- suggests that an ORDER-BY addition to the query would favour the
INTERSECT method.
-- (This led to Test 6).

-- This is the query plan for the INTERSECT query:

EXPLAIN QUERY PLAN SELECT v FROM t1 INTERSECT SELECT v FROM t2;

   -- selectid | order | from | detail
   -- -------- | ----- | ---- |
---------------------------------------------------------
   --     1    |   0   |   0  | SCAN TABLE t1
   --     2    |   0   |   0  | SCAN TABLE t2
   --     0    |   0   |   0  | COMPOUND SUBQUERIES 1 AND 2 USING TEMP
B-TREE (INTERSECT)


-- And the plan for the JOIN query:

EXPLAIN QUERY PLAN SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v;

   -- selectid | order | from | detail
   -- -------- | ----- | ---- |
----------------------------------------------------
   --     0    |   0   |   0  | SCAN TABLE t1
   --     0    |   1   |   1  | SEARCH TABLE t2 USING AUTOMATIC COVERING
INDEX (v=?)

-- Cleanup.
DROP TABLE t1;

DROP TABLE t2;

   --   Script Stats: Total Script Execution Time:     0d 00h 00m and
00.718s
   --                 Total Script Query Time:         0d 00h 00m and
00.683s
   --                 Total Database Rows Changed: 200003
   --                 Total Virtual-Machine Steps: 8501075
   --                 Last executed Item Index:        13
   --                 Last Script Error:
   --
------------------------------------------------------------------------------------------------


-- Note form the log below that both queries created an automatic index
(unsurprisingly as suggested by the Query Plans above).

   -- 2017-09-06 16:10:46.021  |  [Success]    Script Success.
   -- 2017-09-06 16:10:46.066  |  [Success]    Transaction Rolled back.
   -- -------  DB-Engine Logs (Contains logged information from all DB
connections during run)  ------
   -- [2017-09-06 16:10:45.286] APPLICATION : Script
E:\Documents\SQLiteAutoScript.sql started at 16:10:45.286 on 06 September.
   -- [2017-09-06 16:10:45.604] ERROR (284) : automatic index on t2(v)
   -- [2017-09-06 16:10:46.014] ERROR (284) : automatic index on t2(v)
   --
================================================================================================


Test 2: Using idxCount of 100,000 rows and ~1% possible matches (rndDiv
now 100 x idxCount)


INSERT INTO p(idxCount, rndDiv) VALUES (100000, 100*100000);

SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v;

   --    v
   -- -------
   --   1000
   -- 2251245
   -- 6690293
   -- 8754086
   -- 6871656
...
...
   -- 3230498
   -- 6398647
   -- 5083104

   --    Item Stats:  Item No:           8 Query Size (Chars):  45
   --                 Result Columns:    1 Result Rows:         997
   --                 VM Work Steps:     1204006 Rows Modified:       0
   --                 Full Query Time:   0d 00h 00m and 00.187s
   --                 Query Result:      Success.
   --
------------------------------------------------------------------------------------------------

SELECT v FROM t1 INTERSECT SELECT v FROM t2;

   --       v
   -- ------------
   --     1000
   --     2058
   --     12265
...
...
   --    9980854
   --    9985094

   --    Item Stats:  Item No:           9 Query Size (Chars):  46
   --                 Result Columns:    1 Result Rows:         986
   --                 VM Work Steps:     1100517 Rows Modified:       0
   --                 Full Query Time:   0d 00h 00m and 00.353s
   --                 Query Result:      Success.
   --
------------------------------------------------------------------------------------------------

Test 2 Notes: VM steps almost identical to test 1. JOIN is now near
twice as fast.
Note also that JOIN produced 11 rows more than INTERSECT which suggests
we had 11 duplicate values in the tables.



Test 3: Using idxCount of 100,000 rows and ~10% possible matches (rndDiv
now 10 x idxCount)

INSERT INTO p(idxCount,rndDiv) VALUES (100000, 10*100000);

SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v;

   --    v
   -- ------
   --   1000
   -- 894336
   -- 296578
...
...
   -- 795418
   --  66041
   -- 800682

   --    Item Stats:  Item No:           8 Query Size (Chars):  45
   --                 Result Columns:    1 Result Rows:         9993
   --                 VM Work Steps:     1239992 Rows Modified:       0
   --                 Full Query Time:   0d 00h 00m and 00.267s
   --                 Query Result:      Success.
   --
------------------------------------------------------------------------------------------------


SELECT v FROM t1 INTERSECT SELECT v FROM t2;

   --       v
   -- ------------
   --       165
   --       301
   --       353
...
...
   --    999724
   --    999864
   --    999999

   --    Item Stats:  Item No:           9 Query Size (Chars):  46
   --                 Result Columns:    1 Result Rows:         9042
   --                 VM Work Steps:     1103714 Rows Modified:       0
   --                 Full Query Time:   0d 00h 00m and 00.285s
   --                 Query Result:      Success.
   --
------------------------------------------------------------------------------------------------

Test 3 Notes: Again similar VM steps though it seems the join gained
most added VM steps. JOIN and INTERSECT now performing very near equal.

Test 4: Using idxCount of 100,000 rows and ~100% possible matches
(rndDiv now 1 x idxCount)

INSERT INTO p(idxCount,rndDiv) VALUES (100000, 1*100000);


SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v;

   --    v
   -- ------
   --   1000
   -- 297764
   -- 386572
...
...
   --  18950
   --  24377

   --    Item Stats:  Item No:           8 Query Size (Chars):  45
   --                 Result Columns:    1 Result Rows:         99479
   --                 VM Work Steps:     1597937 Rows Modified:       0
   --                 Full Query Time:   0d 00h 00m and 00.812s
   --                 Query Result:      Success.
   --
------------------------------------------------------------------------------------------------

SELECT v FROM t1 INTERSECT SELECT v FROM t2;

   --   v
   -- -----
   --     0
   --     2
   --     5
   --     7
   --     8
   --    14
...
...
   -- 99992
   -- 99996
   -- 99998

   --    Item Stats:  Item No:           9 Query Size (Chars):  46
   --                 Result Columns:    1 Result Rows:         40027
   --                 VM Work Steps:     1069969 Rows Modified:       0
   --                 Full Query Time:   0d 00h 00m and 00.567s
   --                 Query Result:      Success.
   --
------------------------------------------------------------------------------------------------

Test 4 Notes: JOIN is using now more VM steps and INTERSECT is using
even less, and INTERSECT now clearly the faster of the two, not to
mention it still has ORDERED output while JOIN has not.

However...

In these higher match-count tests, the possibility of duplicate entries
in both tables t1 and t2 also increased, to the point where there is a
large percentage of duplicates in test 4 which would have produced more
rows and slower execution from the JOIN query than the INTERSECT query,
especially in cases where the same value was duplicate in both tables.
This almost disqualifies Test 4 as an objective measure.

To better test this in the next tests, I changed the test script to only
allow rows in table t1 and t2 that have no duplicates (made v a PRIMARY
KEY and used INSERT OR IGNORE when populating the tables):

CREATE TABLE t1 (
   v INT PRIMARY KEY
);

CREATE TABLE t2 (
   v INT PRIMARY KEY
);

WITH R(idx,Val) AS (
     SELECT 0, 1000
     UNION ALL
     SELECT idx+1, abs(random() % p.rndDiv) FROM R,p WHERE idx < p.idxCount
)
INSERT OR IGNORE INTO t1(v) SELECT Val FROM R
;

WITH R(idx,Val) AS (
     SELECT 0, 1000
     UNION ALL
     SELECT idx+1, abs(random() % p.rndDiv) FROM R,p WHERE idx < p.idxCount
)
INSERT OR IGNORE INTO t2(v) SELECT Val FROM R
;

Test 5: Using idxCount of 100,000 rows and ~100% possible matches
(rndDiv now 1 x idxCount) - Same as test 4 but without duplicates.

SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v;

   --    Item Stats:  Item No:           8             Query Size
(Chars):  45
   --                 Result Columns:    1 Result Rows:         40099
   --                 VM Work Steps:     476303 Rows Modified:       0
   --                 Full Query Time:   0d 00h 00m and 00.411s
   --                 Query Result:      Success.
   --
------------------------------------------------------------------------------------------------

SELECT t1.v FROM t1 INTERSECT SELECT t2.v FROM t2;

   --    Item Stats:  Item No:           9             Query Size
(Chars):  52
   --                 Result Columns:    1 Result Rows:         40099
   --                 VM Work Steps:     775787 Rows Modified:       0
   --                 Full Query Time:   0d 00h 00m and 00.553s
   --                 Query Result:      Success.
   --
------------------------------------------------------------------------------------------------

Test 5 Notes: The no-duplicates test script favoured JOIN all the way to
100% matches (for both speed and efficiency, but with un-ordered results
for JOIN).


Test 6: Using idxCount of 100,000 rows and ~100% possible matches
(rndDiv now 1 x idxCount) - Same as Test 5 but now explicitly ordered.

SELECT t1.v FROM t1 JOIN t2 ON t2.v = t1.v ORDER BY t1.v;

   --    Item Stats:  Item No:           8             Query Size
(Chars):  59
   --                 Result Columns:    1 Result Rows:         39906
   --                 VM Work Steps:     475396 Rows Modified:       0
   --                 Full Query Time:   0d 00h 00m and 00.281s
   --                 Query Result:      Success.
   --
------------------------------------------------------------------------------------------------

SELECT t1.v FROM t1 INTERSECT SELECT t2.v FROM t2 ORDER BY t1.v;

   --    Item Stats:  Item No:           9             Query Size
(Chars):  66
   --                 Result Columns:    1 Result Rows:         39906
   --                 VM Work Steps:     1330447 Rows Modified:       0
   --                 Full Query Time:   0d 00h 00m and 00.263s
   --                 Query Result:      Success.
   --
------------------------------------------------------------------------------------------------

Test 6 Notes: Changing again to test for ordered results, INTERSECT was
indeed faster again, even for queries with no duplicates - but a much
bigger surprise here is that BOTH queries are significantly faster than
their un-ordered versions even though much less efficient in the
INTERSECT case (almost twice the VM steps but executing in half the time)!


*Deductions*

The following could be inferred from the tests:
-  I did not show the initial set-up tests, but it determined that the
table-size directly correlates with the measured differences in
efficiency - i.e. it is not a factor, so I picked a table-size (100K
rows) that was a good work-out for the Query engine but that ran
relatively quick in human terms.
-  JOIN wins convincingly for a low number of matches, but as the number
of matching values increase, INTERSECT gets more efficient and
eventually wins in situations where duplicate values occur.
-  JOIN's prowess is not linear, it does better (in ratio to INTERSECT)
at 1% matches than at 0.1% matches, and remain superior as long as no
duplicate values occur.
-  The output from INTERSECT is ordered.
-  Output for both JOIN and INTERSECT is faster if you use explicit
ordering on the Joined/Intesected column(s) - This surprised me the most.


*Conclusion*: To be most efficient: If you expect hitting more than 10%
matches out of the total rows in the tables and they contain possible
duplicate values or you need the output ordered - use INTERSECT. For all
other cases, use JOIN, and either way, use ORDER BY on the
Joined/Intersected column to make things faster still.


Cheers,
Ryan


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: JOIN vs. INTERSECT speed and efficiency

Nico Williams
On Wed, Sep 06, 2017 at 08:05:57PM +0200, R Smith wrote:
> -- Another interesting thing to note: The INTERSECT test produces ORDERED
> -- output, which suggests that an ORDER-BY addition to the query would
> -- favour the INTERSECT method.

Nothing about INTERSECT requires it to produce ordered output.

Nothing about the JOIN case makes it not possible to produce ordered
output by accident.

You'll want to re-measure with an ORDER BY added.

In any case, this is quite interesting.  Many uses of JOIN are not
merely to filter results, but to construct joined result rows -- such
uses of JOIN cannot be optimized by using INTERSECT.  But for
filter-uses of JOIN... this might be a useful optimization for the
engine to learn.

Nico
--
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: JOIN vs. INTERSECT speed and efficiency

R Smith
On 2017/09/06 8:26 PM, Nico Williams wrote:
> On Wed, Sep 06, 2017 at 08:05:57PM +0200, R Smith wrote:
>> -- Another interesting thing to note: The INTERSECT test produces ORDERED
>> -- output, which suggests that an ORDER-BY addition to the query would
>> -- favour the INTERSECT method.
> Nothing about INTERSECT requires it to produce ordered output.

No, and it was not suggested, it was just noted that it does, suggesting
that it could be the more performant choice when adding an ORDER BY
clause, which turned out to not only be true in terms of being the
better choice, but also that it itself sped up by simply adding the
ORDER BY clause as was demonstrated in Test 6.

> Nothing about the JOIN case makes it not possible to produce ordered
> output by accident.

Yet it doesn't seem to by accident, which would suggest that an ORDER BY
clause when added to the JOIN statements would incur an additional time
penalty for having to actually order the results - Yet, as again
demonstrated in Test 6, the ORDER BY actually sped up the JOIN query too
(perhaps via forcing the Index earlier or used in a different way) -
which was most interesting, and, as you noted, there is nothing about
the JOIN that precludes it from having ordered output, so this
optimization might be worthwhile.

> You'll want to re-measure with an ORDER BY added.

I did. It was done in Test 6. It showed significantly interesting
results. Was my explanation lacking in clarity or did it fall down the
TLDR; rabbit hole? :)


> In any case, this is quite interesting.  Many uses of JOIN are not
> merely to filter results, but to construct joined result rows -- such
> uses of JOIN cannot be optimized by using INTERSECT.  But for
> filter-uses of JOIN... this might be a useful optimization for the
> engine to learn.

I agree, and not only the INTERSECT optimization but the tests suggest
adding a silent ORDER BY would also be an optimization, though not sure
if the effort-to-pleasure ratio is low enough yet. Perhaps if re-doing
the tests with tables using several more non-Integer columns to see if
the optimization could be generalized across all kinds of data in some
way. I might pursue this later when I have some time.


Cheers,
Ryan
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: JOIN vs. INTERSECT speed and efficiency

Nico Williams
On Wed, Sep 06, 2017 at 10:57:41PM +0200, R Smith wrote:

> On 2017/09/06 8:26 PM, Nico Williams wrote:
> >On Wed, Sep 06, 2017 at 08:05:57PM +0200, R Smith wrote:
> >>-- Another interesting thing to note: The INTERSECT test produces ORDERED
> >>-- output, which suggests that an ORDER-BY addition to the query would
> >>-- favour the INTERSECT method.
> >Nothing about INTERSECT requires it to produce ordered output.
>
> No, and it was not suggested, it was just noted that it does, suggesting
> that it could be the more performant choice when adding an ORDER BY clause,
> which turned out to not only be true in terms of being the better choice,
> but also that it itself sped up by simply adding the ORDER BY clause as was
> demonstrated in Test 6.

I point this out only because users should know not to assume result set
order without an ORDER BY.

> >Nothing about the JOIN case makes it not possible to produce ordered
> >output by accident.
>
> Yet it doesn't seem to by accident, which would suggest that an ORDER BY
> clause when added to the JOIN statements would incur an additional time

See my explanation below.

> penalty for having to actually order the results - Yet, as again
> demonstrated in Test 6, the ORDER BY actually sped up the JOIN query too
> (perhaps via forcing the Index earlier or used in a different way) - which
> was most interesting, and, as you noted, there is nothing about the JOIN
> that precludes it from having ordered output, so this optimization might be
> worthwhile.
>
> >You'll want to re-measure with an ORDER BY added.
>
> I did. It was done in Test 6. It showed significantly interesting results.
> Was my explanation lacking in clarity or did it fall down the TLDR; rabbit
> hole? :)

I saw it.

I think the ORDER BY helped the JOIN because it caused SQLite3 to scan a
covering index (the primary) key instead of scanning the table.  That
without it SQLite3 didn't use that index is rather inefficient, though
it may not be a win in real-world use-cases to fix that.

Of course, IF you had used WITHOUT ROWIDs you would have found (I'm
sure) that the JOIN also produced ordered results by default and was as
fast as in your 6th test.

In fact, INTERSECT does an implicit ordering step by building a b-tree
that the JOIN with the index scan optimization does not have to build at
all, so JOIN has a leg up on INTERSECT in that sense.

> I agree, and not only the INTERSECT optimization but the tests suggest
> adding a silent ORDER BY would also be an optimization, though not sure if

I think here the ORDER BY merely forced SQLite3 to pick the more
efficient query plan, and that it's probably a (rather minor) optimizer
bug that it didn't do so to begin with without the ORDER BY.

> the effort-to-pleasure ratio is low enough yet. Perhaps if re-doing the
> tests with tables using several more non-Integer columns to see if the
> optimization could be generalized across all kinds of data in some way. I
> might pursue this later when I have some time.

If you'll redo this I'd urge you to use WITHOUT ROWIDS.  First, that's
almost always the right thing to do anyways.  Second, it won't perform
worse but likely will perform better.  Third, write performance
definitely should improve with WITHOUT ROWIDS.  Fourth, I think users
are starting to use WITHOUT ROWIDS more, so testing that seems more
useful.

Nico
--
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: JOIN vs. INTERSECT speed and efficiency

R Smith

On 2017/09/06 11:17 PM, Nico Williams wrote:
>
> If you'll redo this I'd urge you to use WITHOUT ROWIDS.  First, that's
> almost always the right thing to do anyways.  Second, it won't perform
> worse but likely will perform better.  Third, write performance
> definitely should improve with WITHOUT ROWIDS.  Fourth, I think users
> are starting to use WITHOUT ROWIDS more, so testing that seems more
> useful.
>
> Nico

All good points which I will definitely heed. I will of course test both
types of tables but ensure that when using row_id tables I do not
inadvertently have possible PK short-circuiting of query plans.

It's still remarkable that in both tests 5 and 6 I've used the very same
PK setup, yet Test 6 was significantly faster with the added ORDER BY
clause. In tests 1 through 4 I did not use a PK at all, just plain INT
data field, but then I did not test the ORDER BY in those tests.

It might turn out to be a wild goose chase, but that will be easily
evident when testing without the PK and with more realistic real-world
data. I'll do that this weekend.

Cheers,
Ryan

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: JOIN vs. INTERSECT speed and efficiency

Nico Williams
On Wed, Sep 06, 2017 at 11:54:35PM +0200, R Smith wrote:
> It's still remarkable that in both tests 5 and 6 I've used the very same PK
> setup, yet Test 6 was significantly faster with the added ORDER BY clause.
> In tests 1 through 4 I did not use a PK at all, just plain INT data field,
> but then I did not test the ORDER BY in those tests.

It's an optimizer issue.  It didn't occur to the optimizer that scanning
a covering index was better than scanning the table because the covering
index has (in this case, and always, for rowid tables anyways) strictly
less contents to read and decode.

Scanning the covering index has the happy side-effect (if you wanted if) of
producing ordered results and making an equivalent ORDER BY free.

(Whereas scanning the table will produce results in rowid order, which is
almost certainly not useful unless you explicitly wanted an INTEGER
PRIMARY KEY.)

Note that INTERSECT could have used a hash table, thus producing
unordered results (most likely).  But SQLite3 only knows b-trees.

All of this explains the accidental ordering / non-ordering.  And also
why you shouldn't count on it: it's all implementaton details!

But just because you know to add an ORDER BY doesn't mean you shouldn't
think to make it match some suitable index...  The optimizer is nice,
but you still have to think a little bit like an optimizer yourself :(

> It might turn out to be a wild goose chase, but that will be easily evident
> when testing without the PK and with more realistic real-world data. I'll do
> that this weekend.

You can't have a PK-less table -- SQLite3 always want some PK, even if
it's a hidden rowid column.  WITHOUT ROWID tables make this clearer:

  sqlite> create table t(a text) without rowid;
  Error: PRIMARY KEY missing on table t

Adding an explicit PK implicitly added the covering index that sped up
the JOIN (once you forced the optimizer to use it).  But you should just
always have had an explicit PK and WITHOUT ROWID.

You still found something interesting about using JOINs to filter result
sets (as opposed to adding columns to the the result set).  Something to
keep in mind...  I do a lot of queries where some of the JOINed tables
are used only for filtering.  It's not always possible to convert such
queries to INTERSECT, but it might be possible for SQLite3 to learn how
to perform the equivalent optimization internally, and when to do it.

Nico
--
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: JOIN vs. INTERSECT speed and efficiency

Keith Medcalf
In reply to this post by R Smith

Try the same test using 147 columns in each table.

1 column is rather trivial.  Even a kindergarten kid could do it in no time using crayons and the wall.

And of course the output of INTERSECT is ordered.  It uses a sorter to perform the intersection.  And of course the output is distinct, it uses a sorter to perform the intersection.

In other words,

select ... a bunch of columns ...
from table1
intersect
select ... an eual number of bunch of columns ...
from table2

is equivalent to

select ... the bunch of columns ...
  from table1
 where exists (select * from table2
                where (for each column position in table 2 equals that column position from table1, plus of course all the added stuff needed to handle nullable columns))
group by ... the bunch of columns ...;

In other words except in very trivial cases (like having only one column that is not nullable) it will be very difficult to write a "correct" JOIN or correlated subquery that emulates an INTERSECT.

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.


>-----Original Message-----
>From: sqlite-users [mailto:sqlite-users-
>[hidden email]] On Behalf Of R Smith
>Sent: Wednesday, 6 September, 2017 14:58
>To: [hidden email]
>Subject: Re: [sqlite] JOIN vs. INTERSECT speed and efficiency
>
>On 2017/09/06 8:26 PM, Nico Williams wrote:
>> On Wed, Sep 06, 2017 at 08:05:57PM +0200, R Smith wrote:
>>> -- Another interesting thing to note: The INTERSECT test produces
>ORDERED
>>> -- output, which suggests that an ORDER-BY addition to the query
>would
>>> -- favour the INTERSECT method.
>> Nothing about INTERSECT requires it to produce ordered output.
>
>No, and it was not suggested, it was just noted that it does,
>suggesting
>that it could be the more performant choice when adding an ORDER BY
>clause, which turned out to not only be true in terms of being the
>better choice, but also that it itself sped up by simply adding the
>ORDER BY clause as was demonstrated in Test 6.
>
>> Nothing about the JOIN case makes it not possible to produce
>ordered
>> output by accident.
>
>Yet it doesn't seem to by accident, which would suggest that an ORDER
>BY
>clause when added to the JOIN statements would incur an additional
>time
>penalty for having to actually order the results - Yet, as again
>demonstrated in Test 6, the ORDER BY actually sped up the JOIN query
>too
>(perhaps via forcing the Index earlier or used in a different way) -
>which was most interesting, and, as you noted, there is nothing about
>the JOIN that precludes it from having ordered output, so this
>optimization might be worthwhile.
>
>> You'll want to re-measure with an ORDER BY added.
>
>I did. It was done in Test 6. It showed significantly interesting
>results. Was my explanation lacking in clarity or did it fall down
>the
>TLDR; rabbit hole? :)
>
>
>> In any case, this is quite interesting.  Many uses of JOIN are not
>> merely to filter results, but to construct joined result rows --
>such
>> uses of JOIN cannot be optimized by using INTERSECT.  But for
>> filter-uses of JOIN... this might be a useful optimization for the
>> engine to learn.
>
>I agree, and not only the INTERSECT optimization but the tests
>suggest
>adding a silent ORDER BY would also be an optimization, though not
>sure
>if the effort-to-pleasure ratio is low enough yet. Perhaps if re-
>doing
>the tests with tables using several more non-Integer columns to see
>if
>the optimization could be generalized across all kinds of data in
>some
>way. I might pursue this later when I have some time.
>
>
>Cheers,
>Ryan
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: JOIN vs. INTERSECT speed and efficiency

R Smith

On 2017/09/07 3:43 AM, Keith Medcalf wrote:
> Try the same test using 147 columns in each table.

Exactly the plan for this weekend :)

> 1 column is rather trivial.  Even a kindergarten kid could do it in no time using crayons and the wall.

So? That is non-sequitur, I am sure given enough crayons, wall-space and
time, a kindergarten kid can do it with 147 columns too. That says
exactly nothing about the possible efficiencies of different methods. If
however the 1-columness of the test gets somehow advantaged by being the
PK (as Nico pointed out) or real world data such as TEXT entries sort
slower than INTs, then it might affect it, so the 147 column tests will
tell.


> In other words except in very trivial cases (like having only one column that is not nullable) it will be very difficult to write a "correct" JOIN or correlated subquery that emulates an INTERSECT.

Well I agree, but it is those trivial cases that are of interest here,
and if there is a general JOIN optimization to be had. The INTERSECT
test merely served as the catalyst to put us on the trail of the
possible JOIN optimization, if there is even an optimization to be had
(it might yet be a wild goose chase, which you seem to have your money
on, so watch this space, I'll graciously accept your "told ya!" later
after testing).


Cheers,
Ryan

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: JOIN vs. INTERSECT speed and efficiency

Nico Williams
In reply to this post by Keith Medcalf
On Wed, Sep 06, 2017 at 07:43:07PM -0600, Keith Medcalf wrote:

> Try the same test using 147 columns in each table.
>
> 1 column is rather trivial.  Even a kindergarten kid could do it in no
> time using crayons and the wall.
>
> [...]
>
> In other words except in very trivial cases (like having only one
> column that is not nullable) it will be very difficult to write a
> "correct" JOIN or correlated subquery that emulates an INTERSECT.

Yup.  But that doesn't mean that the engine couldn't internally build a
result-set from the query without some filtering JOIN, then implement
the same strategy as an INTERSECT.  You can't do this in SQL if the
filter table has different shape than the result set, but the engine
might be able to do it.

On the other hand, building a complete result set first is... not online
behavior.  If the result set size is enormous, then the INTERSECT
approach is going to make the user very unhappy!

I do think OP's tests point out a case where SQLite3 is pessimally
picking table scan over covering index scan...

...though scanning the index
means that there will be no rowid column in the result, which might
actually be a compatibility issue when using rowid tables, so maybe
SQLite3 is doing exactly the right thing?

I don't think that pessimization is too consequential as users can
improve the situation by adding ORDER BY clauses or using WITHOUT ROWID.

Nico
--
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: JOIN vs. INTERSECT speed and efficiency

David Raymond
In reply to this post by R Smith
Although it may not translate as well to the more complex examples, would you also consider adding the IN operator to your tests? I found for example that "select v from t1 where v in t2;" did even better than the join or the intersect.

Other "am I right in thinking this" question: INTERSECT is only going to be viable when comparing full records, correct? If you're looking to filter table A by whether its primary key is also a primary key for table B, but ignoring the other fields in both, then INTERSECT becomes not an option, or at least starts making the query more complex/ugly... correct?



-----Original Message-----
From: sqlite-users [mailto:[hidden email]] On Behalf Of R Smith
Sent: Thursday, September 07, 2017 8:06 AM
To: [hidden email]
Subject: Re: [sqlite] JOIN vs. INTERSECT speed and efficiency


On 2017/09/07 3:43 AM, Keith Medcalf wrote:
> Try the same test using 147 columns in each table.

Exactly the plan for this weekend :)

> 1 column is rather trivial.  Even a kindergarten kid could do it in no time using crayons and the wall.

So? That is non-sequitur, I am sure given enough crayons, wall-space and
time, a kindergarten kid can do it with 147 columns too. That says
exactly nothing about the possible efficiencies of different methods. If
however the 1-columness of the test gets somehow advantaged by being the
PK (as Nico pointed out) or real world data such as TEXT entries sort
slower than INTs, then it might affect it, so the 147 column tests will
tell.


> In other words except in very trivial cases (like having only one column that is not nullable) it will be very difficult to write a "correct" JOIN or correlated subquery that emulates an INTERSECT.

Well I agree, but it is those trivial cases that are of interest here,
and if there is a general JOIN optimization to be had. The INTERSECT
test merely served as the catalyst to put us on the trail of the
possible JOIN optimization, if there is even an optimization to be had
(it might yet be a wild goose chase, which you seem to have your money
on, so watch this space, I'll graciously accept your "told ya!" later
after testing).


Cheers,
Ryan

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: JOIN vs. INTERSECT speed and efficiency

R Smith
On 2017/09/07 6:31 PM, David Raymond wrote:
> Although it may not translate as well to the more complex examples, would you also consider adding the IN operator to your tests? I found for example that "select v from t1 where v in t2;" did even better than the join or the intersect.

Will do. The only thing I have somewhat against that specific query form
is that it doesn't work in other engines (seems to not be standard). But
for optimization in SQLite specifically that's perfect.

> Other "am I right in thinking this" question: INTERSECT is only going to be viable when comparing full records, correct? If you're looking to filter table A by whether its primary key is also a primary key for table B, but ignoring the other fields in both, then INTERSECT becomes not an option, or at least starts making the query more complex/ugly... correct?

INTERSECT will happily match however many columns you desire (and
specify), there is no need to match full records or single keys
specifically.


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: JOIN vs. INTERSECT speed and efficiency

Nico Williams
On Thu, Sep 07, 2017 at 09:51:07PM +0200, R Smith wrote:
> INTERSECT will happily match however many columns you desire (and specify),
> there is no need to match full records or single keys specifically.

But the two queries on either side of the set operator must have the
same number of columns (and in strongly-typed RDBMSes, the same types).

That's a huge constraint.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: JOIN vs. INTERSECT speed and efficiency

David Raymond
In reply to this post by R Smith
Expanding things for when you get bored, in addition to <INNER> JOIN vs INTERSECT vs IN I'd also be interested in <OUTER> JOIN vs EXCEPT vs NOT IN, as I tend to do more exclusion rather than intersection.

The straight up "IN tablename" may be SQLite only, but it also supports IN (subquery) so "select v from t1 where v in (select v from t2)" also works, although the "IN tablename" is so much cleaner to read in my opinion.

create table statesToSkip (
  state text primary key collate nocase
) without rowid;

select foo from bar where state not in statesToSkip.

vs

select foo from bar where state not in (select state from statesToSkip)

vs

select bar.foo from bar left outer join statesToSkip
on (bar.state = statesToSkip.state)
where statesToSkip.state is null;


-----Original Message-----
From: sqlite-users [mailto:[hidden email]] On Behalf Of R Smith
Sent: Thursday, September 07, 2017 3:51 PM
To: [hidden email]
Subject: Re: [sqlite] JOIN vs. INTERSECT speed and efficiency

On 2017/09/07 6:31 PM, David Raymond wrote:
> Although it may not translate as well to the more complex examples, would you also consider adding the IN operator to your tests? I found for example that "select v from t1 where v in t2;" did even better than the join or the intersect.

Will do. The only thing I have somewhat against that specific query form
is that it doesn't work in other engines (seems to not be standard). But
for optimization in SQLite specifically that's perfect.

> Other "am I right in thinking this" question: INTERSECT is only going to be viable when comparing full records, correct? If you're looking to filter table A by whether its primary key is also a primary key for table B, but ignoring the other fields in both, then INTERSECT becomes not an option, or at least starts making the query more complex/ugly... correct?

INTERSECT will happily match however many columns you desire (and
specify), there is no need to match full records or single keys
specifically.


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: JOIN vs. INTERSECT speed and efficiency

Keith Medcalf
In reply to this post by David Raymond

>Other "am I right in thinking this" question: INTERSECT is only going
>to be viable when comparing full records, correct? If you're looking
>to filter table A by whether its primary key is also a primary key
>for table B, but ignoring the other fields in both, then INTERSECT
>becomes not an option, or at least starts making the query more
>complex/ugly... correct?

No.  The data must merely be the same shape (order, number of columns):

SELECT c1, c2, c3, c8 from t1
INTERSECT
SELECT x5, g4, q7, b3 from t2;

each of t1 and t2 may have fields named a1 a2 a4 ... a26 b1 b2 b3 ... b26 ... z1 z2 z3 ... z26 (for a total of 676 fields per table).

The comparison is of resulting matrices, not underlying tables.  Of course, if you do "SELECT * from table1;" the * is merely "syntactic sugar" (a short way of spelling) the complete, in declaration order, list of explicit columns.  (So, if and only if the ROWID is an "explicit column" is it used in the intersection, otherwise it is not -- and concomitantly if you explicitly list the columns to intersect, then the rowid participates if and only if you have included it in the list of data to intersect.)

INTERSECT / EXCEPT / UNION are matrix operations.  The RHS (select statement before the operator) and LHS (select statement after the operator) provide the two matrices on which the operation is performed.  Matrixes do not have column names, merely ordinal positions (column 1, column 2, column 3 and so forth).  Similarly these operations do not care about column names, merely that the order (number of columns in each matrix) is the same.  Comparisons are done by ordinal position of the item in the row.  The output is a matrix.  It only appears "row at a time" because of the primitive method of operation of (no matter how advanced) computers which can only perform operations as a serial sequence of steps.  For "user convenience" the output column names are set to the RHS column names.

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.

>-----Original Message-----
>From: sqlite-users [mailto:sqlite-users-
>[hidden email]] On Behalf Of David Raymond
>Sent: Thursday, 7 September, 2017 10:31
>To: SQLite mailing list
>Subject: Re: [sqlite] JOIN vs. INTERSECT speed and efficiency
>
>Although it may not translate as well to the more complex examples,
>would you also consider adding the IN operator to your tests? I found
>for example that "select v from t1 where v in t2;" did even better
>than the join or the intersect.
>
>
>
>
>-----Original Message-----
>From: sqlite-users [mailto:sqlite-users-
>[hidden email]] On Behalf Of R Smith
>Sent: Thursday, September 07, 2017 8:06 AM
>To: [hidden email]
>Subject: Re: [sqlite] JOIN vs. INTERSECT speed and efficiency
>
>
>On 2017/09/07 3:43 AM, Keith Medcalf wrote:
>> Try the same test using 147 columns in each table.
>
>Exactly the plan for this weekend :)
>
>> 1 column is rather trivial.  Even a kindergarten kid could do it in
>no time using crayons and the wall.
>
>So? That is non-sequitur, I am sure given enough crayons, wall-space
>and
>time, a kindergarten kid can do it with 147 columns too. That says
>exactly nothing about the possible efficiencies of different methods.
>If
>however the 1-columness of the test gets somehow advantaged by being
>the
>PK (as Nico pointed out) or real world data such as TEXT entries sort
>slower than INTs, then it might affect it, so the 147 column tests
>will
>tell.
>
>
>> In other words except in very trivial cases (like having only one
>column that is not nullable) it will be very difficult to write a
>"correct" JOIN or correlated subquery that emulates an INTERSECT.
>
>Well I agree, but it is those trivial cases that are of interest
>here,
>and if there is a general JOIN optimization to be had. The INTERSECT
>test merely served as the catalyst to put us on the trail of the
>possible JOIN optimization, if there is even an optimization to be
>had
>(it might yet be a wild goose chase, which you seem to have your
>money
>on, so watch this space, I'll graciously accept your "told ya!" later
>after testing).
>
>
>Cheers,
>Ryan
>
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

JOIN vs. INTERSECT vs. WHERE IN (...) - speed and efficiency differences

R Smith
In reply to this post by David Raymond
Full tests completed with findings ranging from less interesting to
exposing a rather significant inefficiency in SQLite.

I won't post all the tests because that would take far too much space,
in stead I will simply discuss the experiment and findings and post the
test script so that anyone can verify the findings.


*Problem*:
Earlier tests suggested that, in a query where the common rows between
two tables need to be found, JOIN is faster than INTERSECT when there
are very few matches with efficiencies evening out as the number of
matches increase. It suggested, more pertinently, that an ORDER BY
clause when added to both types of queries improved all-round
performance by a significant margin (moreso in JOIN's case).
After examination of the tests, comments suggested that a lot of the
efficiency differences may be due to:
- row_id involvement,
- checks only done on Primary key values as opposed to common row
values, and
- test tables having one or two columns only not matching any real-world
scenario which made the findings irrelevant.
It was also pointed out that apart from JOIN and INTERSECT, a WHERE IN
(SELECT FROM ...) clause could achieve the same query result set and
should form part of the tests.


*Purpose*:
This experiment and tests aimed to establish the differences in
efficiency for queries with the same results that could be posed in
different ways to the Query Planner in specifically SQLite.
The possible ways to ask the same question that was tested included
similar queries (producing the same result set) when finding the common
set of rows between two tables using JOIN, INTERSECT and WHERE IN (...),
and secondly similar queries finding the exclusive set of rows between
two tables using OUTER JOIN, EXCEPT and WHERE NOT IN (...).
[SQLite doesn't implement a full outer join so that specific test was
restricted to comparing EXCEPT vs. WHERE NOT IN (...)]


*Method*:
A simple SQL script that first establishes a table "p" with parameters
that dictate the data shape added to the tests. Next a random value
table "rnd" is set up with the number of rows and randomness scope
dictated by table "p" and then two test tables "t1" and "t2" with each
151 columns (1 Key and 30 each of 5 data types - INT, REAL, NUMERIC
(Dates), BLOB and TEXT) which gets populated with randomly selected rows
from the "rnd" table so that many rows appear only in either table, and
some rows appear in both (coincidence).

The total number of rows to  choose from is dictated by the "p" table's
"rndCount" field and the approximate percentage of rows that appear in
both tables is dictated by the "rndAddFreq" (random row adding
frequency) field.
Up to here the tests are all similar, except that for the EXCEPT / JOIN
WHERE NOT IN (...) a much higher frequency of coincidence between the
table rows were selected since the output count is opposite to the other
tests.
The queries designed to produce the exact same output using the
different methods discussed above are run in turn. The queries are then
repeated with an added ORDER BY clause.
Timings are measured for all.


*Test parameters* to ensure all potential problems mentioned above were
accounted for:
- Pre-setup of the random table ensures all the queries operate on the
exact same randomized data.
- Tables contain 150 data columns of all typical use types and with
varying data and data lengths (in the case of BLOB and TEXT at least)
- Tables have a Primary Key, but it does not alias the Row_ID
- Tests were repeated for both normal and WITHOUT ROWID tables
- Query field comparisons involved a full set of field types (except
BLOB columns) and not the primary key
- Test-bed using older/slower notebook and sqlitespeed on Windows 10
64bit prof with updated latest sqlite3 DLL file - version 3.20.1 Release
2017-08-24.


*General Notes*:
- Queries were tested in different orders to avoid caching effects, but
this made no difference - it seems there is no mystery caching when the
queries are so different.
- The EXCEPT vs. WHERE NOT IN (...) queries yielded consistently similar
times through a range of parameters, with or without ORDER BY clauses,
with or without ROWIDs, so their tests provided no real insight, is
probably both well-optimized already, and I won't refer to them again in
this document.


*Findings* pertaining to JOIN, INTERSECT and WHERE IN (...) type queries
on tables *with Row_ids*:
First important thing to note is that when not involving the primary key
and using these larger (wider) tables, the ORDER BY clause addition made
very little difference (sometimes completely zero difference). JOIN and
INTERSECT posted similar times for these larger (wider) tables with
INTERSECT consistently winning by a very slight margin.
The surprise here was that WHERE IN (...) queries were consistently
faster than both JOIN and INTERSECT (again with the added ORDER BY being
inconsequential). Not by a large margin though, 4-5% on average.
*RESULT*: No clear opportunity for optimization exists for the above case.


*Findings* pertaining to JOIN, INTERSECT and WHERE IN (...) type queries
on tables *WITHOUT Row_ids*:
(This is the full test posted below because it is the one that matters most)
INTERSECT AND WHERE IN (...) queries posted similar times here with
WHERE IN (...) being the slightly faster (similar to the above
findings), but BOTH were a good 10 - 15% faster than on the ROW-ID
tables, so WITHOUT ROWID tables seem to have a definite advantage here
(it is currently unclear to me why this is).

A troubling test is the JOIN on WITHOUT ROWID tables - it took several
orders of magnitude longer than any other test in the entire experiment.
Typical test times posted in the order of between 10 to 200
milliseconds. The JOIN query here took over a minute on this test bed -
that /must/ be a glaring mistake in the Query planner or match method
somewhere.

*RESULT*: No clear opportunity for optimization exists for INTERSECT and
WHERE IN (...) queries above, but the JOIN query has perhaps a QP
deficiency that needs fixing or a Query execution optimization that is
possible.


*Test Script*:
Note that you can increase / decrease the number of random generated
rows by adjusting the p.rndCount, and adjust the frequency of coinciding
rows (2..18 where 2 is highest and 18 lowest coincidence) with the
p.rndAddFreq values inserted into the p table. I found 100,000 rows and
18 frequency produces test tables with roughly 1000 coinciding hits for
the queries as a good measure.

You will want to place it inside a transaction maybe, turn timers on and
pipe the output to a file, it can get quite lengthy.

This posted version of the test script is precisely the one that
highlights the JOIN inefficiency mentioned above. On the test bed used,
it posted times in the order of 50-60 milliseconds for the INTERCEPT and
WHERE IN (...) queries, and /over a minute/ for the JOIN query. If you
simply remove the "WITHOUT ROWID" specifiers form the two test tables,
this entire script will run in under 3 seconds typically. Add the
"WITHOUT ROWID" (as below) and the script takes minutes. (I also tested
with older 3.17.0 DLL, it was even slower, but not significantly so).


-- Test Paremeters Table
CREATE TABLE p (
   rndCount INT,
   rndAddFreq INT,
   rndText TEXT
);

-- Test Parameters (and a random text block)
INSERT INTO p(rndCount, rndAddFreq, rndText) VALUES
(100000,
  18,
  'Sed ut perspiciatis, unde omnis iste natus error sit voluptatem
accusantium doloremque '||
  'laudantium, totam rem aperiam eaque ipsa, quae ab illo inventore
veritatis et quasi architecto '||
  'beatae vitae dicta sunt, explicabo. Nemo enim ipsam voluptatem, quia
voluptas sit, aspernatur '||
  'aut odit aut fugit, sed quia consequuntur magni dolores eos, qui
ratione voluptatem sequi '||
  'nesciunt, neque porro quisquam est, qui dolorem ipsum, quia dolor sit
amet consectetur '||
  'adipiscing velit, sed quia non numquam [do] eius modi tempora
inci[di]dunt, ut labore et dolore '||
  'magnam aliquam quaerat voluptatem.'
);

-- Randomized values table & population
CREATE TABLE rnd(ri INTEGER PRIMARY KEY, rr REAL, rl INT);

WITH R(i, rndInt, rndReal, rndLen) AS (
     SELECT 0,0,0.1,0
     UNION ALL
     SELECT R.i+1, abs(random() % (10*p.rndCount)), abs(random() %
(1000*p.rndCount))/1000.0, abs(random() % 64) + 1
       FROM R,p
      WHERE R.i < (p.rndCount * 2)
)
INSERT OR IGNORE INTO rnd(ri, rr, rl)
SELECT rndInt, rndReal, rndLen
   FROM R
  WHERE R.i > 0;

-- Test Tables with 151 columns of different Types
CREATE TABLE t1 (id INT PRIMARY KEY,
   i00 INT,  r00 REAL,  n00 NUMERIC,  b00 BLOB,  v00 TEXT,
   i01 INT,  r01 REAL,  n01 NUMERIC,  b01 BLOB,  v01 TEXT,
   i02 INT,  r02 REAL,  n02 NUMERIC,  b02 BLOB,  v02 TEXT,
   i03 INT,  r03 REAL,  n03 NUMERIC,  b03 BLOB,  v03 TEXT,
   i04 INT,  r04 REAL,  n04 NUMERIC,  b04 BLOB,  v04 TEXT,
   i05 INT,  r05 REAL,  n05 NUMERIC,  b05 BLOB,  v05 TEXT,
   i06 INT,  r06 REAL,  n06 NUMERIC,  b06 BLOB,  v06 TEXT,
   i07 INT,  r07 REAL,  n07 NUMERIC,  b07 BLOB,  v07 TEXT,
   i08 INT,  r08 REAL,  n08 NUMERIC,  b08 BLOB,  v08 TEXT,
   i09 INT,  r09 REAL,  n09 NUMERIC,  b09 BLOB,  v09 TEXT,
   i10 INT,  r10 REAL,  n10 NUMERIC,  b10 BLOB,  v10 TEXT,
   i11 INT,  r11 REAL,  n11 NUMERIC,  b11 BLOB,  v11 TEXT,
   i12 INT,  r12 REAL,  n12 NUMERIC,  b12 BLOB,  v12 TEXT,
   i13 INT,  r13 REAL,  n13 NUMERIC,  b13 BLOB,  v13 TEXT,
   i14 INT,  r14 REAL,  n14 NUMERIC,  b14 BLOB,  v14 TEXT,
   i15 INT,  r15 REAL,  n15 NUMERIC,  b15 BLOB,  v15 TEXT,
   i16 INT,  r16 REAL,  n16 NUMERIC,  b16 BLOB,  v16 TEXT,
   i17 INT,  r17 REAL,  n17 NUMERIC,  b17 BLOB,  v17 TEXT,
   i18 INT,  r18 REAL,  n18 NUMERIC,  b18 BLOB,  v18 TEXT,
   i19 INT,  r19 REAL,  n19 NUMERIC,  b19 BLOB,  v19 TEXT,
   i20 INT,  r20 REAL,  n20 NUMERIC,  b20 BLOB,  v20 TEXT COLLATE NOCASE,
   i21 INT,  r21 REAL,  n21 NUMERIC,  b21 BLOB,  v21 TEXT COLLATE NOCASE,
   i22 INT,  r22 REAL,  n22 NUMERIC,  b22 BLOB,  v22 TEXT COLLATE NOCASE,
   i23 INT,  r23 REAL,  n23 NUMERIC,  b23 BLOB,  v23 TEXT COLLATE NOCASE,
   i24 INT,  r24 REAL,  n24 NUMERIC,  b24 BLOB,  v24 TEXT COLLATE NOCASE,
   i25 INT,  r25 REAL,  n25 NUMERIC,  b25 BLOB,  v25 TEXT COLLATE BINARY,
   i26 INT,  r26 REAL,  n26 NUMERIC,  b26 BLOB,  v26 TEXT COLLATE BINARY,
   i27 INT,  r27 REAL,  n27 NUMERIC,  b27 BLOB,  v27 TEXT COLLATE BINARY,
   i28 INT,  r28 REAL,  n28 NUMERIC,  b28 BLOB,  v28 TEXT COLLATE BINARY,
   i29 INT,  r29 REAL,  n29 NUMERIC,  b29 BLOB,  v29 TEXT COLLATE BINARY
) WITHOUT ROWID;

CREATE TABLE t2 (id INT PRIMARY KEY,
   i00 INT,  r00 REAL,  n00 NUMERIC,  b00 BLOB,  v00 TEXT,
   i01 INT,  r01 REAL,  n01 NUMERIC,  b01 BLOB,  v01 TEXT,
   i02 INT,  r02 REAL,  n02 NUMERIC,  b02 BLOB,  v02 TEXT,
   i03 INT,  r03 REAL,  n03 NUMERIC,  b03 BLOB,  v03 TEXT,
   i04 INT,  r04 REAL,  n04 NUMERIC,  b04 BLOB,  v04 TEXT,
   i05 INT,  r05 REAL,  n05 NUMERIC,  b05 BLOB,  v05 TEXT,
   i06 INT,  r06 REAL,  n06 NUMERIC,  b06 BLOB,  v06 TEXT,
   i07 INT,  r07 REAL,  n07 NUMERIC,  b07 BLOB,  v07 TEXT,
   i08 INT,  r08 REAL,  n08 NUMERIC,  b08 BLOB,  v08 TEXT,
   i09 INT,  r09 REAL,  n09 NUMERIC,  b09 BLOB,  v09 TEXT,
   i10 INT,  r10 REAL,  n10 NUMERIC,  b10 BLOB,  v10 TEXT,
   i11 INT,  r11 REAL,  n11 NUMERIC,  b11 BLOB,  v11 TEXT,
   i12 INT,  r12 REAL,  n12 NUMERIC,  b12 BLOB,  v12 TEXT,
   i13 INT,  r13 REAL,  n13 NUMERIC,  b13 BLOB,  v13 TEXT,
   i14 INT,  r14 REAL,  n14 NUMERIC,  b14 BLOB,  v14 TEXT,
   i15 INT,  r15 REAL,  n15 NUMERIC,  b15 BLOB,  v15 TEXT,
   i16 INT,  r16 REAL,  n16 NUMERIC,  b16 BLOB,  v16 TEXT,
   i17 INT,  r17 REAL,  n17 NUMERIC,  b17 BLOB,  v17 TEXT,
   i18 INT,  r18 REAL,  n18 NUMERIC,  b18 BLOB,  v18 TEXT,
   i19 INT,  r19 REAL,  n19 NUMERIC,  b19 BLOB,  v19 TEXT,
   i20 INT,  r20 REAL,  n20 NUMERIC,  b20 BLOB,  v20 TEXT COLLATE NOCASE,
   i21 INT,  r21 REAL,  n21 NUMERIC,  b21 BLOB,  v21 TEXT COLLATE NOCASE,
   i22 INT,  r22 REAL,  n22 NUMERIC,  b22 BLOB,  v22 TEXT COLLATE NOCASE,
   i23 INT,  r23 REAL,  n23 NUMERIC,  b23 BLOB,  v23 TEXT COLLATE NOCASE,
   i24 INT,  r24 REAL,  n24 NUMERIC,  b24 BLOB,  v24 TEXT COLLATE NOCASE,
   i25 INT,  r25 REAL,  n25 NUMERIC,  b25 BLOB,  v25 TEXT COLLATE BINARY,
   i26 INT,  r26 REAL,  n26 NUMERIC,  b26 BLOB,  v26 TEXT COLLATE BINARY,
   i27 INT,  r27 REAL,  n27 NUMERIC,  b27 BLOB,  v27 TEXT COLLATE BINARY,
   i28 INT,  r28 REAL,  n28 NUMERIC,  b28 BLOB,  v28 TEXT COLLATE BINARY,
   i29 INT,  r29 REAL,  n29 NUMERIC,  b29 BLOB,  v29 TEXT COLLATE BINARY
) WITHOUT ROWID;

-- Populate the Test tables with Randomized data
WITH R(vi, vr, vt, vn, vb) AS (
     SELECT rnd.ri,
        rnd.rr,
        substr(p.rndText,1,rnd.rl),
        date('1901-01-01','+'||CAST((rnd.ri % 5000) AS TEXT)||' days'),
        randomblob(rnd.rl)
       FROM rnd,p
)
INSERT OR IGNORE INTO t1(id,
   i00, r00, n00, b00, v00,
   i01, r01, n01, b01, v01,
   i02, r02, n02, b02, v02,
   i03, r03, n03, b03, v03,
   i04, r04, n04, b04, v04,
   i05, r05, n05, b05, v05,
   i06, r06, n06, b06, v06,
   i07, r07, n07, b07, v07,
   i08, r08, n08, b08, v08,
   i09, r09, n09, b09, v09,
   i10, r10, n10, b10, v10,
   i11, r11, n11, b11, v11,
   i12, r12, n12, b12, v12,
   i13, r13, n13, b13, v13,
   i14, r14, n14, b14, v14,
   i15, r15, n15, b15, v15,
   i16, r16, n16, b16, v16,
   i17, r17, n17, b17, v17,
   i18, r18, n18, b18, v18,
   i19, r19, n19, b19, v19,
   i20, r20, n20, b20, v20,
   i21, r21, n21, b21, v21,
   i22, r22, n22, b22, v22,
   i23, r23, n23, b23, v23,
   i24, r24, n24, b24, v24,
   i25, r25, n25, b25, v25,
   i26, r26, n26, b26, v26,
   i27, r27, n27, b27, v27,
   i28, r28, n28, b28, v28,
   i29, r29, n29, b29, v29
) SELECT vi,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt
   FROM R,p
  WHERE ((vi / 100) % p.rndAddFreq) = 1
;

WITH R(vi, vr, vt, vn, vb) AS (
     SELECT rnd.ri,
        rnd.rr,
        substr(p.rndText,1,rnd.rl),
        date('1901-01-01','+'||CAST((rnd.ri % 5000) AS TEXT)||' days'),
        randomblob(rnd.rl)
       FROM rnd,p
)
INSERT OR IGNORE INTO t2(id,
   i00, r00, n00, b00, v00,
   i01, r01, n01, b01, v01,
   i02, r02, n02, b02, v02,
   i03, r03, n03, b03, v03,
   i04, r04, n04, b04, v04,
   i05, r05, n05, b05, v05,
   i06, r06, n06, b06, v06,
   i07, r07, n07, b07, v07,
   i08, r08, n08, b08, v08,
   i09, r09, n09, b09, v09,
   i10, r10, n10, b10, v10,
   i11, r11, n11, b11, v11,
   i12, r12, n12, b12, v12,
   i13, r13, n13, b13, v13,
   i14, r14, n14, b14, v14,
   i15, r15, n15, b15, v15,
   i16, r16, n16, b16, v16,
   i17, r17, n17, b17, v17,
   i18, r18, n18, b18, v18,
   i19, r19, n19, b19, v19,
   i20, r20, n20, b20, v20,
   i21, r21, n21, b21, v21,
   i22, r22, n22, b22, v22,
   i23, r23, n23, b23, v23,
   i24, r24, n24, b24, v24,
   i25, r25, n25, b25, v25,
   i26, r26, n26, b26, v26,
   i27, r27, n27, b27, v27,
   i28, r28, n28, b28, v28,
   i29, r29, n29, b29, v29
) SELECT vi,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt,
   vi, vr, vn, vb, vt
   FROM R,p
  WHERE ((vi / 10) % p.rndAddFreq) = 1
;

-- Simply make the chosen parameters visible
SELECT * FROM p;


-- Test JOIN method  -- This is the baddie.
SELECT t1.i00, t1.r00, t1.n00, t1.v00
   FROM t1
   JOIN t2 ON t2.i00 = t1.i00 AND t2.r00 = t1.r00 AND t2.n00 = t1.n00
AND t2.v00 = t1.v00
;

-- Test INTERSECT method
SELECT t1.i00, t1.r00, t1.n00, t1.v00
   FROM t1
INTERSECT
SELECT t2.i00, t2.r00, t2.n00, t2.v00
   FROM t2
;

-- Test WHERE IN (...) method
SELECT t1.i00, t1.r00, t1.n00, t1.v00
   FROM t1
WHERE t1.i00 IN (SELECT t2.i00 FROM t2)
;

-- Test JOIN with ORDER BY
SELECT t1.i00, t1.r00, t1.n00, t1.v00
   FROM t1
   JOIN t2 ON t2.i00 = t1.i00 AND t2.r00 = t1.r00 AND t2.n00 = t1.n00
AND t2.v00 = t1.v00
  ORDER BY t1.i00, t1.r00, t1.n00, t1.v00
;

-- Test INTERSECT with ORDER BY
SELECT t1.i00, t1.r00, t1.n00, t1.v00
   FROM t1
INTERSECT
SELECT t2.i00, t2.r00, t2.n00, t2.v00
   FROM t2
  ORDER BY t1.i00, t1.r00, t1.n00, t1.v00
;

-- Test WHERE IN (...) with ORDER BY
SELECT t1.i00, t1.r00, t1.n00, t1.v00
   FROM t1
WHERE t1.i00 IN (SELECT t2.i00 FROM t2)
  ORDER BY t1.i00, t1.r00, t1.n00, t1.v00
;

-- Display simply the COUNTs from our random and test tables and the
matching
-- values, to ensure the test is valid.
WITH RowCounts(Entity, RowCnt) AS (
     SELECT 'Randomizer', COUNT(*) FROM rnd
     UNION ALL
     SELECT 'Test Table t1', COUNT(*) FROM t1
     UNION ALL
     SELECT 'Test Table t2', COUNT(*) FROM t2
     UNION ALL
     SELECT 't1 : t2 Matches', COUNT(*) FROM t2 WHERE t2.i00 IN (SELECT
t1.i00 FROM t1)
)
SELECT * FROM RowCounts
;

-- CleanUp
DROP TABLE rnd;
DROP TABLE t1;
DROP TABLE t2;
DROP TABLE p;



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: JOIN vs. INTERSECT vs. WHERE IN (...) - speed and efficiency differences

Nico Williams
On Sat, Sep 09, 2017 at 05:56:03PM +0200, R Smith wrote:

> *Findings* pertaining to JOIN, INTERSECT and WHERE IN (...) type queries on
> tables *WITHOUT Row_ids*:
> (This is the full test posted below because it is the one that matters most)
> INTERSECT AND WHERE IN (...) queries posted similar times here with WHERE IN
> (...) being the slightly faster (similar to the above findings), but BOTH
> were a good 10 - 15% faster than on the ROW-ID tables, so WITHOUT ROWID
> tables seem to have a definite advantage here (it is currently unclear to me
> why this is).
>
> A troubling test is the JOIN on WITHOUT ROWID tables - it took several
> orders of magnitude longer than any other test in the entire experiment.

In your first test you were ordering by PK, now you're not, and you
don't have an [covering] index on the columns you're ordering by, so,
yeah, "orders of magnitude" slower is to be expected.  You're comparing
apples and oranges.

Also, in your first test you have one column.  Now you have lots.
That's no way to isolate the performance numbers you're trying to get.

The only way to have an ORDER BY speed up a query is when there's an
index on the columns in question that can be scanned to produce the
order you're looking for without additional sorting, or when the engine
can create such an index as a temporary index (this, of course, has a
cost, so it won't always work well)

Nico
--
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
12