"DISTINCT" makes a query take 37 times as long

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

"DISTINCT" makes a query take 37 times as long

Kevin O'Gorman
I have a database of positions and moves in a strategic game, and I'm
searching for unsolved positions that have been connected to an immediate
ancestor.  I'm using Python 3.5.2, and the code looks like

#!/usr/bin/env python3
"""Output positions that are reachable but unsolved at census 18 or greater
See page 76 of Qubic log

Last Modified: Tue Jan 31 12:13:07 PST 2017
"""

import sqlite3          # https://docs.python.org/3.5/library/sqlite3.html

with sqlite3.connect("917.db") as conn:
    for row in conn.execute("""
                SELECT DISTINCT ppos
                FROM move JOIN pos ON mto = pnum
                WHERE pcensus = 18 and pmin < pmax
            """):
        print(row[0])

As written here, this query runs for 1193 minutes (just short of 20
hours).  If I remove the "DISTINCT" and instead pipe the result into the
sort program that comes with Linux "sort --unique" the query and sort takes
only 31 minutes.  The results are the same, and consist of 4.2 million rows.

This seems extreme.

--
word of the year: *kakistocracy*
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: "DISTINCT" makes a query take 37 times as long

Richard Hipp-3
On 2/1/17, Kevin O'Gorman <[hidden email]> wrote:
> I have a database of positions and moves in a strategic game, and I'm
> searching for unsolved positions that have been connected to an immediate
> ancestor.  I'm using Python 3.5.2, and the code looks like

Please provide us with the following additional information:

(1) In python, run the query: "SELECT sqlite_version(), sqlite_source_id();"

(2) In a recent sqlite3 command-line shell (the latest release, not
whatever 5-year-old release happens to be installed on your system)
bring up your database and run the command:

     .fullschema --indent

And send in the output.

(3) Download the bundle of command-line tools for your OS, then run
the command "sqlite3_analyzer" on your database, and send in the
output.

Thanks.


>
> #!/usr/bin/env python3
> """Output positions that are reachable but unsolved at census 18 or greater
> See page 76 of Qubic log
>
> Last Modified: Tue Jan 31 12:13:07 PST 2017
> """
>
> import sqlite3          # https://docs.python.org/3.5/library/sqlite3.html
>
> with sqlite3.connect("917.db") as conn:
>     for row in conn.execute("""
>                 SELECT DISTINCT ppos
>                 FROM move JOIN pos ON mto = pnum
>                 WHERE pcensus = 18 and pmin < pmax
>             """):
>         print(row[0])
>
> As written here, this query runs for 1193 minutes (just short of 20
> hours).  If I remove the "DISTINCT" and instead pipe the result into the
> sort program that comes with Linux "sort --unique" the query and sort takes
> only 31 minutes.  The results are the same, and consist of 4.2 million rows.
>
> This seems extreme.
>
> --
> word of the year: *kakistocracy*
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>


--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: "DISTINCT" makes a query take 37 times as long

Kevin O'Gorman
On Wed, Feb 1, 2017 at 6:35 PM, Richard Hipp <[hidden email]> wrote:

> On 2/1/17, Kevin O'Gorman <[hidden email]> wrote:
> > I have a database of positions and moves in a strategic game, and I'm
> > searching for unsolved positions that have been connected to an immediate
> > ancestor.  I'm using Python 3.5.2, and the code looks like
>
> Please provide us with the following additional information:
>
> (1) In python, run the query: "SELECT sqlite_version(),
> sqlite_source_id();"
>
> (2) In a recent sqlite3 command-line shell (the latest release, not
> whatever 5-year-old release happens to be installed on your system)
> bring up your database and run the command:
>
>      .fullschema --indent
>
> And send in the output.
>
> (3) Download the bundle of command-line tools for your OS, then run
> the command "sqlite3_analyzer" on your database, and send in the
> output.
>
> Thanks.
>
>
>
I am unable to comply with items 2 and 3.  I can download the linux x86
versions, which I expected would run on my x86-64 system, but they don't.
Instead, even when I point right at them, they report "No such file or
directory".  I take this to mean that there is some file they do not find,
like a library, and they report the error code in their return status.

However, my "recent" software reports:
 SQLite version 3.11.0 2016-02-15 17:29:24
Enter ".help" for usage hints.
sqlite> .fullschema --indent
Usage: .fullschema
sqlite> .fullschema
CREATE TABLE base64 (
            b64char CHAR NOT NULL PRIMARY KEY,
            b64val  INTEGER);
CREATE TABLE pos (
            pnum INTEGER PRIMARY KEY AUTOINCREMENT,
            ppos CHAR(64) NOT NULL,
            pcensus INTEGER NOT NULL,
            pscore INTEGER,
            pstate CHAR DEFAULT "N" NOT NULL,
            pmin INTEGER DEFAULT -99 NOT NULL,
            pmax INTEGER DEFAULT 99 NOT NULL,
            pmain CHAR(64));
CREATE UNIQUE INDEX pipos ON pos (ppos);
CREATE TABLE move (
            mfrom INTEGER NOT NULL,
            mto   INTEGER NOT NULL,
            mtype CHAR NOT NULL,
            mcell INTEGER NOT NULL,
            mvalue INTEGER,
            ma INTEGER DEFAULT -99,
            mb INTEGER DEFAULT 99,
            PRIMARY KEY (mfrom, mto, mcell));
CREATE UNIQUE INDEX mrev ON move (mto, mfrom, mcell);
CREATE TABLE expanded (
            census INTEGER NOT NULL,
            number INTEGER NOT NULL,
            pos CHAR(64),
            PRIMARY KEY (census, number));
ANALYZE sqlite_master;
ANALYZE sqlite_master;
INSERT INTO sqlite_stat1 VALUES('move','mrev','48329866 2 2 1');
INSERT INTO sqlite_stat1 VALUES('move','sqlite_autoindex_move_1','48329866
38 2 1');
INSERT INTO sqlite_stat1 VALUES('pos','pipos','74409802 1');
INSERT INTO sqlite_stat1 VALUES('base64','sqlite_autoindex_base64_1','64
1');
ANALYZE sqlite_master;
sqlite>

The analyzer is not included in my distribution or its repositiories, as
far as I can tell.  This is Xubuntu, which is a flavor of Ubuntu, which is
derived from Debian.

I'm not sure I want to build your entire software suite.  Perhaps you'd
care to download my database, which I freshly tar-ed and gzip-ed to
http://kosmanor.com/917/917.db.tgz
the databse is 21 GB; the tar is 3.1 GB

> >
> > #!/usr/bin/env python3
> > """Output positions that are reachable but unsolved at census 18 or
> greater
> > See page 76 of Qubic log
> >
> > Last Modified: Tue Jan 31 12:13:07 PST 2017
> > """
> >
> > import sqlite3          # https://docs.python.org/3.5/
> library/sqlite3.html
> >
> > with sqlite3.connect("917.db") as conn:
> >     for row in conn.execute("""
> >                 SELECT DISTINCT ppos
> >                 FROM move JOIN pos ON mto = pnum
> >                 WHERE pcensus = 18 and pmin < pmax
> >             """):
> >         print(row[0])
> >
> > As written here, this query runs for 1193 minutes (just short of 20
> > hours).  If I remove the "DISTINCT" and instead pipe the result into the
> > sort program that comes with Linux "sort --unique" the query and sort
> takes
> > only 31 minutes.  The results are the same, and consist of 4.2 million
> rows.
> >
> > This seems extreme.
> >
> > --
> > word of the year: *kakistocracy*
> > _______________________________________________
> > sqlite-users mailing list
> > [hidden email]
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >
>
>
> --
> D. Richard Hipp
> [hidden email]
>



--
word of the year: *kakistocracy*
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: "DISTINCT" makes a query take 37 times as long

Hick Gunter
In reply to this post by Kevin O'Gorman
DISTINCT forces the query optimizer to create an intermediate table to hold the results and compare each row of the non-distinct result set with an automatically created index. It may also affect the query plan in a way that chooses inefficient indices, which is more likely if you have not run ANALYZE on the fully loaded database.

Using a 3 stage pipe instead you additionally have more CPUs (1 running the query, 1 or more sorting the results) working in paralell.

Try EXPLAIN QUERY PLAN to see what the query planner is doing.

-----Ursprüngliche Nachricht-----
Von: sqlite-users [mailto:[hidden email]] Im Auftrag von Kevin O'Gorman
Gesendet: Donnerstag, 02. Februar 2017 03:28
An: sqlite-users <[hidden email]>
Betreff: [sqlite] "DISTINCT" makes a query take 37 times as long

I have a database of positions and moves in a strategic game, and I'm searching for unsolved positions that have been connected to an immediate ancestor.  I'm using Python 3.5.2, and the code looks like

#!/usr/bin/env python3
"""Output positions that are reachable but unsolved at census 18 or greater See page 76 of Qubic log

Last Modified: Tue Jan 31 12:13:07 PST 2017 """

import sqlite3          # https://docs.python.org/3.5/library/sqlite3.html

with sqlite3.connect("917.db") as conn:
    for row in conn.execute("""
                SELECT DISTINCT ppos
                FROM move JOIN pos ON mto = pnum
                WHERE pcensus = 18 and pmin < pmax
            """):
        print(row[0])

As written here, this query runs for 1193 minutes (just short of 20 hours).  If I remove the "DISTINCT" and instead pipe the result into the sort program that comes with Linux "sort --unique" the query and sort takes only 31 minutes.  The results are the same, and consist of 4.2 million rows.

This seems extreme.

--
word of the year: *kakistocracy*
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


___________________________________________
 Gunter Hick
Software Engineer
Scientific Games International GmbH
FN 157284 a, HG Wien
Klitschgasse 2-4, A-1130 Vienna, Austria
Tel: +43 1 80100 0
E-Mail: [hidden email]

This communication (including any attachments) is intended for the use of the intended recipient(s) only and may contain information that is confidential, privileged or legally protected. Any unauthorized use or dissemination of this communication is strictly prohibited. If you have received this communication in error, please immediately notify the sender by return e-mail message and delete all copies of the original communication. Thank you for your cooperation.


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: "DISTINCT" makes a query take 37 times as long

Richard Hipp-3
On 2/2/17, Hick Gunter <[hidden email]> wrote:
> DISTINCT forces the query optimizer to create an intermediate table to hold
> the results and compare each row of the non-distinct result set with an
> automatically created index. It may also affect the query plan in a way that
> chooses inefficient indices, which is more likely if you have not run
> ANALYZE on the fully loaded database.

Creating an intermediate table is one way in which DISTINCT might be
implemented.  The query planner might also try to force the output
into sorted order, so that it can eliminate duplicates simply by
comparing against the previous output.  The second mechanism can cause
the use of inefficient indexes, if they are present and un-ANALYZED.
The query planner computes an estimated run-time for each of various
techniques it considers, and picks the one it thinks will run the
fastest.  Running ANALYZE helps the query planner to generate better
(more accurate) cost estimates.

I have not yet analyzed this situation sufficiently to tell what is going on.
--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: "DISTINCT" makes a query take 37 times as long

Kevin O'Gorman
In reply to this post by Hick Gunter
When I read this, it seemed like it made sense.  The thing is, it does not
match up with reality.

First, the analysis of what happens when I pipe the results to 'sort'
misses the fact that the sort process executes within the 31 minutes of
that version.  It would not make a dent in the time of the slow version.

But the big thing is that I took a look at EXPLAIN QUERY PLAN using this
script:
#!/usr/bin/env
python3

"""Output positions that are reachable but unsolved at census 18
See page 76 of Qubic log

Last Modified: Thu Feb  2 07:46:03 PST 2017
"""

import sqlite3          # https://docs.python.org/3.5/library/sqlite3.html

with sqlite3.connect("917.db") as conn:
    print("BEOFRE ANALYZE")
    for row in conn.execute("""
                EXPLAIN
                SELECT DISTINCT ppos
                FROM move JOIN pos ON mto = pnum
                WHERE pcensus = 18 and pmin < pmax
            """):
        print(row)
    print()
    print()
    conn.execute("ANALYZE")
    print("AFTER ANALYZE")
    for row in conn.execute("""
                EXPLAIN
                SELECT DISTINCT ppos
                FROM move JOIN pos ON mto = pnum
                WHERE pcensus = 18 and pmin < pmax
            """):
        print(row)

and after waiting most of the day for the analyze to finish, I got two
identical query plans,
neither of which I could decipher:
BEFORE
ANALYZE

(0, 'Init', 0, 25, 0, '', '00', None)
(1, 'Null', 1, 7, 0, '', '08', None)
(2, 'OpenRead', 1, 4, 0, '7', '00', None)
(3, 'OpenRead', 3, 6, 0, 'k(1,)', '00', None)
(4, 'OpenRead', 4, 11, 0, 'k(3,,,)', '02', None)
(5, 'Rewind', 3, 21, 1, '0', '00', None)
(6, 'Seek', 3, 0, 1, '', '00', None)
(7, 'Column', 1, 2, 1, '', '00', None)
(8, 'Ne', 2, 20, 1, '(BINARY)', '54', None)
(9, 'Column', 1, 5, 3, '-99', '00', None)
(10, 'Column', 1, 6, 4, '99', '00', None)
(11, 'Ge', 4, 20, 3, '(BINARY)', '53', None)
(12, 'IdxRowid', 3, 5, 0, '', '00', None)
(13, 'SeekGE', 4, 20, 5, '1', '00', None)
(14, 'IdxGT', 4, 20, 5, '1', '00', None)
(15, 'Column', 3, 0, 6, '', '00', None)
(16, 'Eq', 6, 19, 7, '(BINARY)', '80', None)
(17, 'Copy', 6, 7, 0, '', '00', None)
(18, 'ResultRow', 6, 1, 0, '', '00', None)
(19, 'Next', 4, 14, 0, '', '00', None)
(20, 'Next', 3, 6, 0, '', '01', None)
(21, 'Close', 1, 0, 0, '', '00', None)
(22, 'Close', 3, 0, 0, '', '00', None)
(23, 'Close', 4, 0, 0, '', '00', None)
(24, 'Halt', 0, 0, 0, '', '00', None)
(25, 'Transaction', 0, 0, 155, '0', '01', None)
(26, 'TableLock', 0, 4, 0, 'pos', '00', None)
(27, 'TableLock', 0, 7, 0, 'move', '00', None)
(28, 'Integer', 18, 2, 0, '', '00', None)
(29, 'Goto', 0, 1, 0, '', '00', None)


AFTER ANALYZE
(0, 'Init', 0, 25, 0, '', '00', None)
(1, 'Null', 1, 7, 0, '', '08', None)
(2, 'OpenRead', 1, 4, 0, '7', '00', None)
(3, 'OpenRead', 3, 6, 0, 'k(1,)', '00', None)
(4, 'OpenRead', 4, 11, 0, 'k(3,,,)', '02', None)
(5, 'Rewind', 3, 21, 1, '0', '00', None)
(6, 'Seek', 3, 0, 1, '', '00', None)
(7, 'Column', 1, 2, 1, '', '00', None)
(8, 'Ne', 2, 20, 1, '(BINARY)', '54', None)
(9, 'Column', 1, 5, 3, '-99', '00', None)
(10, 'Column', 1, 6, 4, '99', '00', None)
(11, 'Ge', 4, 20, 3, '(BINARY)', '53', None)
(12, 'IdxRowid', 3, 5, 0, '', '00', None)
(13, 'SeekGE', 4, 20, 5, '1', '00', None)
(14, 'IdxGT', 4, 20, 5, '1', '00', None)
(15, 'Column', 3, 0, 6, '', '00', None)
(16, 'Eq', 6, 19, 7, '(BINARY)', '80', None)
(17, 'Copy', 6, 7, 0, '', '00', None)
(18, 'ResultRow', 6, 1, 0, '', '00', None)
(19, 'Next', 4, 14, 0, '', '00', None)
(20, 'Next', 3, 6, 0, '', '01', None)
(21, 'Close', 1, 0, 0, '', '00', None)
(22, 'Close', 3, 0, 0, '', '00', None)
(23, 'Close', 4, 0, 0, '', '00', None)
(24, 'Halt', 0, 0, 0, '', '00', None)
(25, 'Transaction', 0, 0, 155, '0', '01', None)
(26, 'TableLock', 0, 4, 0, 'pos', '00', None)
(27, 'TableLock', 0, 7, 0, 'move', '00', None)
(28, 'Integer', 18, 2, 0, '', '00', None)
(29, 'Goto', 0, 1, 0, '', '00', None)
Maybe somebody can explain them to me, but it doesn't really matter whether
I ever understand them.
Perhaps Mr. Hipp can make use of them.

Absent some flaw in the above script, I think I'm done with this.  I have a
solution that works for me,
and I'd just as soon get back to my real task.  I just wanted to give
feedback in case it would be useful.
That's how i say thanks for a really useful product.

Thanks. Mr. Hipp, and anyone else that has contributed to this product.

++ kevin


On Thu, Feb 2, 2017 at 12:27 AM, Hick Gunter <[hidden email]> wrote:

> DISTINCT forces the query optimizer to create an intermediate table to
> hold the results and compare each row of the non-distinct result set with
> an automatically created index. It may also affect the query plan in a way
> that chooses inefficient indices, which is more likely if you have not run
> ANALYZE on the fully loaded database.
>
> Using a 3 stage pipe instead you additionally have more CPUs (1 running
> the query, 1 or more sorting the results) working in paralell.
>
> Try EXPLAIN QUERY PLAN to see what the query planner is doing.
>
> -----Ursprüngliche Nachricht-----
> Von: sqlite-users [mailto:[hidden email]]
> Im Auftrag von Kevin O'Gorman
> Gesendet: Donnerstag, 02. Februar 2017 03:28
> An: sqlite-users <[hidden email]>
> Betreff: [sqlite] "DISTINCT" makes a query take 37 times as long
>
> I have a database of positions and moves in a strategic game, and I'm
> searching for unsolved positions that have been connected to an immediate
> ancestor.  I'm using Python 3.5.2, and the code looks like
>
> #!/usr/bin/env python3
> """Output positions that are reachable but unsolved at census 18 or
> greater See page 76 of Qubic log
>
> Last Modified: Tue Jan 31 12:13:07 PST 2017 """
>
> import sqlite3          # https://docs.python.org/3.5/library/sqlite3.html
>
> with sqlite3.connect("917.db") as conn:
>     for row in conn.execute("""
>                 SELECT DISTINCT ppos
>                 FROM move JOIN pos ON mto = pnum
>                 WHERE pcensus = 18 and pmin < pmax
>             """):
>         print(row[0])
>
> As written here, this query runs for 1193 minutes (just short of 20
> hours).  If I remove the "DISTINCT" and instead pipe the result into the
> sort program that comes with Linux "sort --unique" the query and sort takes
> only 31 minutes.  The results are the same, and consist of 4.2 million rows.
>
> This seems extreme.
>
> --
> word of the year: *kakistocracy*
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
>
> ___________________________________________
>  Gunter Hick
> Software Engineer
> Scientific Games International GmbH
> FN 157284 a, HG Wien
> Klitschgasse 2-4, A-1130 Vienna, Austria
> Tel: +43 1 80100 0
> E-Mail: [hidden email]
>
> This communication (including any attachments) is intended for the use of
> the intended recipient(s) only and may contain information that is
> confidential, privileged or legally protected. Any unauthorized use or
> dissemination of this communication is strictly prohibited. If you have
> received this communication in error, please immediately notify the sender
> by return e-mail message and delete all copies of the original
> communication. Thank you for your cooperation.
>
>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>



--
word of the year: *kakistocracy*
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: "DISTINCT" makes a query take 37 times as long

Dominique Devienne
On Fri, Feb 3, 2017 at 12:27 AM, Kevin O'Gorman <[hidden email]>
wrote:

> But the big thing is that I took a look at EXPLAIN QUERY PLAN using this
> ...

Maybe somebody can explain them to me, but it doesn't really matter whether
> I ever understand them. Perhaps Mr. Hipp can make use of them.
>

EXPLAIN, and EXPLAIN QUERY PLAN have completely different output.
EXPLAIN is basically the SQLite "assembler" code (or its "bytecode" if you
prefer),
while EXPLAIN QUERY PLAN gives you a much more human readable high-level
overview of the plan.

The low-level plans do look identical, but please also share the high-level
plan, should take you only a minute. --DD
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: "DISTINCT" makes a query take 37 times as long

Hick Gunter
In reply to this post by Kevin O'Gorman
EXPLAIN dumps SQLite "machine code". Use EXPLAIN QUERY PLAN for the human readable version.

-----Ursprüngliche Nachricht-----
Von: sqlite-users [mailto:[hidden email]] Im Auftrag von Kevin O'Gorman
Gesendet: Freitag, 03. Februar 2017 00:27
An: SQLite mailing list <[hidden email]>
Betreff: Re: [sqlite] "DISTINCT" makes a query take 37 times as long

When I read this, it seemed like it made sense.  The thing is, it does not match up with reality.

First, the analysis of what happens when I pipe the results to 'sort'
misses the fact that the sort process executes within the 31 minutes of that version.  It would not make a dent in the time of the slow version.

But the big thing is that I took a look at EXPLAIN QUERY PLAN using this
script:
#!/usr/bin/env
python3

"""Output positions that are reachable but unsolved at census 18 See page 76 of Qubic log

Last Modified: Thu Feb  2 07:46:03 PST 2017 """

import sqlite3          # https://docs.python.org/3.5/library/sqlite3.html

with sqlite3.connect("917.db") as conn:
    print("BEOFRE ANALYZE")
    for row in conn.execute("""
                EXPLAIN
                SELECT DISTINCT ppos
                FROM move JOIN pos ON mto = pnum
                WHERE pcensus = 18 and pmin < pmax
            """):
        print(row)
    print()
    print()
    conn.execute("ANALYZE")
    print("AFTER ANALYZE")
    for row in conn.execute("""
                EXPLAIN
                SELECT DISTINCT ppos
                FROM move JOIN pos ON mto = pnum
                WHERE pcensus = 18 and pmin < pmax
            """):
        print(row)

and after waiting most of the day for the analyze to finish, I got two identical query plans, neither of which I could decipher:
BEFORE
ANALYZE

(0, 'Init', 0, 25, 0, '', '00', None)
(1, 'Null', 1, 7, 0, '', '08', None)
(2, 'OpenRead', 1, 4, 0, '7', '00', None) (3, 'OpenRead', 3, 6, 0, 'k(1,)', '00', None) (4, 'OpenRead', 4, 11, 0, 'k(3,,,)', '02', None) (5, 'Rewind', 3, 21, 1, '0', '00', None) (6, 'Seek', 3, 0, 1, '', '00', None) (7, 'Column', 1, 2, 1, '', '00', None) (8, 'Ne', 2, 20, 1, '(BINARY)', '54', None) (9, 'Column', 1, 5, 3, '-99', '00', None) (10, 'Column', 1, 6, 4, '99', '00', None) (11, 'Ge', 4, 20, 3, '(BINARY)', '53', None) (12, 'IdxRowid', 3, 5, 0, '', '00', None) (13, 'SeekGE', 4, 20, 5, '1', '00', None) (14, 'IdxGT', 4, 20, 5, '1', '00', None) (15, 'Column', 3, 0, 6, '', '00', None) (16, 'Eq', 6, 19, 7, '(BINARY)', '80', None) (17, 'Copy', 6, 7, 0, '', '00', None) (18, 'ResultRow', 6, 1, 0, '', '00', None) (19, 'Next', 4, 14, 0, '', '00', None) (20, 'Next', 3, 6, 0, '', '01', None) (21, 'Close', 1, 0, 0, '', '00', None) (22, 'Close', 3, 0, 0, '', '00', None) (23, 'Close', 4, 0, 0, '', '00', None) (24, 'Halt', 0, 0, 0, '', '00', None) (25, 'Transaction', 0, 0, 155, '0', '01', None) (26, 'TableLock', 0, 4, 0, 'pos', '00', None) (27, 'TableLock', 0, 7, 0, 'move', '00', None) (28, 'Integer', 18, 2, 0, '', '00', None) (29, 'Goto', 0, 1, 0, '', '00', None)


AFTER ANALYZE
(0, 'Init', 0, 25, 0, '', '00', None)
(1, 'Null', 1, 7, 0, '', '08', None)
(2, 'OpenRead', 1, 4, 0, '7', '00', None) (3, 'OpenRead', 3, 6, 0, 'k(1,)', '00', None) (4, 'OpenRead', 4, 11, 0, 'k(3,,,)', '02', None) (5, 'Rewind', 3, 21, 1, '0', '00', None) (6, 'Seek', 3, 0, 1, '', '00', None) (7, 'Column', 1, 2, 1, '', '00', None) (8, 'Ne', 2, 20, 1, '(BINARY)', '54', None) (9, 'Column', 1, 5, 3, '-99', '00', None) (10, 'Column', 1, 6, 4, '99', '00', None) (11, 'Ge', 4, 20, 3, '(BINARY)', '53', None) (12, 'IdxRowid', 3, 5, 0, '', '00', None) (13, 'SeekGE', 4, 20, 5, '1', '00', None) (14, 'IdxGT', 4, 20, 5, '1', '00', None) (15, 'Column', 3, 0, 6, '', '00', None) (16, 'Eq', 6, 19, 7, '(BINARY)', '80', None) (17, 'Copy', 6, 7, 0, '', '00', None) (18, 'ResultRow', 6, 1, 0, '', '00', None) (19, 'Next', 4, 14, 0, '', '00', None) (20, 'Next', 3, 6, 0, '', '01', None) (21, 'Close', 1, 0, 0, '', '00', None) (22, 'Close', 3, 0, 0, '', '00', None) (23, 'Close', 4, 0, 0, '', '00', None) (24, 'Halt', 0, 0, 0, '', '00', None) (25, 'Transaction', 0, 0, 155, '0', '01', None) (26, 'TableLock', 0, 4, 0, 'pos', '00', None) (27, 'TableLock', 0, 7, 0, 'move', '00', None) (28, 'Integer', 18, 2, 0, '', '00', None) (29, 'Goto', 0, 1, 0, '', '00', None) Maybe somebody can explain them to me, but it doesn't really matter whether I ever understand them.
Perhaps Mr. Hipp can make use of them.

Absent some flaw in the above script, I think I'm done with this.  I have a solution that works for me, and I'd just as soon get back to my real task.  I just wanted to give feedback in case it would be useful.
That's how i say thanks for a really useful product.

Thanks. Mr. Hipp, and anyone else that has contributed to this product.

++ kevin


On Thu, Feb 2, 2017 at 12:27 AM, Hick Gunter <[hidden email]> wrote:

> DISTINCT forces the query optimizer to create an intermediate table to
> hold the results and compare each row of the non-distinct result set
> with an automatically created index. It may also affect the query plan
> in a way that chooses inefficient indices, which is more likely if you
> have not run ANALYZE on the fully loaded database.
>
> Using a 3 stage pipe instead you additionally have more CPUs (1
> running the query, 1 or more sorting the results) working in paralell.
>
> Try EXPLAIN QUERY PLAN to see what the query planner is doing.
>
> -----Ursprüngliche Nachricht-----
> Von: sqlite-users
> [mailto:[hidden email]]
> Im Auftrag von Kevin O'Gorman
> Gesendet: Donnerstag, 02. Februar 2017 03:28
> An: sqlite-users <[hidden email]>
> Betreff: [sqlite] "DISTINCT" makes a query take 37 times as long
>
> I have a database of positions and moves in a strategic game, and I'm
> searching for unsolved positions that have been connected to an
> immediate ancestor.  I'm using Python 3.5.2, and the code looks like
>
> #!/usr/bin/env python3
> """Output positions that are reachable but unsolved at census 18 or
> greater See page 76 of Qubic log
>
> Last Modified: Tue Jan 31 12:13:07 PST 2017 """
>
> import sqlite3          # https://docs.python.org/3.5/library/sqlite3.html
>
> with sqlite3.connect("917.db") as conn:
>     for row in conn.execute("""
>                 SELECT DISTINCT ppos
>                 FROM move JOIN pos ON mto = pnum
>                 WHERE pcensus = 18 and pmin < pmax
>             """):
>         print(row[0])
>
> As written here, this query runs for 1193 minutes (just short of 20
> hours).  If I remove the "DISTINCT" and instead pipe the result into
> the sort program that comes with Linux "sort --unique" the query and
> sort takes only 31 minutes.  The results are the same, and consist of 4.2 million rows.
>
> This seems extreme.
>
> --
> word of the year: *kakistocracy*
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
>
> ___________________________________________
>  Gunter Hick
> Software Engineer
> Scientific Games International GmbH
> FN 157284 a, HG Wien
> Klitschgasse 2-4, A-1130 Vienna, Austria
> Tel: +43 1 80100 0
> E-Mail: [hidden email]
>
> This communication (including any attachments) is intended for the use
> of the intended recipient(s) only and may contain information that is
> confidential, privileged or legally protected. Any unauthorized use or
> dissemination of this communication is strictly prohibited. If you
> have received this communication in error, please immediately notify
> the sender by return e-mail message and delete all copies of the
> original communication. Thank you for your cooperation.
>
>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>



--
word of the year: *kakistocracy*
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


___________________________________________
 Gunter Hick
Software Engineer
Scientific Games International GmbH
FN 157284 a, HG Wien
Klitschgasse 2-4, A-1130 Vienna, Austria
Tel: +43 1 80100 0
E-Mail: [hidden email]

This communication (including any attachments) is intended for the use of the intended recipient(s) only and may contain information that is confidential, privileged or legally protected. Any unauthorized use or dissemination of this communication is strictly prohibited. If you have received this communication in error, please immediately notify the sender by return e-mail message and delete all copies of the original communication. Thank you for your cooperation.


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: "DISTINCT" makes a query take 37 times as long

David Raymond
In reply to this post by Dominique Devienne
Un-analyzed here's what I'm getting while looking at the db:

With distinct:

sqlite> explain query plan select distinct ppos from move join pos on mto = pnum where pcensus = 18 and pmin < pmax;
selectid|order|from|detail
0|0|1|SCAN TABLE pos USING INDEX pipos
0|1|0|SEARCH TABLE move USING COVERING INDEX mrev (mto=?)


Without distict:

sqlite> explain query plan select ppos from move join pos on mto = pnum where pcensus = 18 and pmin < pmax;
selectid|order|from|detail
0|0|1|SCAN TABLE pos
0|1|0|SEARCH TABLE move USING COVERING INDEX mrev (mto=?)


With distinct:

sqlite> explain select distinct ppos from move join pos on mto = pnum where pcensus = 18 and pmin < pmax;
addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     22    0                    00  Start at 22
1     Null           1     7     0                    08  r[7]=NULL
2     OpenRead       1     4     0     7              00  root=4 iDb=0; pos
3     OpenRead       3     6     0     k(1,)          00  root=6 iDb=0; pipos
4     OpenRead       4     11    0     k(3,,,)        02  root=11 iDb=0; mrev
5     Rewind         3     21    1     0              00
6       Seek           3     0     1                    00  Move 1 to 3.rowid
7       Column         1     2     1                    00  r[1]=pos.pcensus
8       Ne             2     20    1     (BINARY)       54  if r[1]!=r[2] goto 20
9       Column         1     5     3     -99            00  r[3]=pos.pmin
10      Column         1     6     4     99             00  r[4]=pos.pmax
11      Ge             4     20    3     (BINARY)       53  if r[3]>=r[4] goto 20
12      IdxRowid       3     5     0                    00  r[5]=rowid
13      SeekGE         4     20    5     1              00  key=r[5]
14        IdxGT          4     20    5     1              00  key=r[5]
15        Column         3     0     6                    00  r[6]=pos.ppos
16        Eq             6     19    7     (BINARY)       80  if r[7]==r[6] goto 19
17        Copy           6     7     0                    00  r[7]=r[6]
18        ResultRow      6     1     0                    00  output=r[6]
19      Next           4     14    0                    00
20    Next           3     6     0                    01
21    Halt           0     0     0                    00
22    Transaction    0     0     155   0              01  usesStmtJournal=0
23    Integer        18    2     0                    00  r[2]=18
24    Goto           0     1     0                    00


Without distinct:

sqlite> explain select ppos from move join pos on mto = pnum where pcensus = 18 and pmin < pmax;
addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     17    0                    00  Start at 17
1     OpenRead       1     4     0     7              00  root=4 iDb=0; pos
2     OpenRead       2     11    0     k(3,,,)        02  root=11 iDb=0; mrev
3     Rewind         1     16    0                    00
4       Column         1     2     1                    00  r[1]=pos.pcensus
5       Ne             2     15    1     (BINARY)       54  if r[1]!=r[2] goto 15
6       Column         1     5     3     -99            00  r[3]=pos.pmin
7       Column         1     6     4     99             00  r[4]=pos.pmax
8       Ge             4     15    3     (BINARY)       53  if r[3]>=r[4] goto 15
9       Rowid          1     5     0                    00  r[5]=rowid
10      SeekGE         2     15    5     1              00  key=r[5]
11        IdxGT          2     15    5     1              00  key=r[5]
12        Column         1     1     6                    00  r[6]=pos.ppos
13        ResultRow      6     1     0                    00  output=r[6]
14      Next           2     11    0                    00
15    Next           1     4     0                    01
16    Halt           0     0     0                    00
17    Transaction    0     0     155   0              01  usesStmtJournal=0
18    Integer        18    2     0                    00  r[2]=18
19    Goto           0     1     0                    00


Will run analyze and re-run those when I get time.

-----Original Message-----
From: sqlite-users [mailto:[hidden email]] On Behalf Of Dominique Devienne
Sent: Friday, February 03, 2017 3:55 AM
To: SQLite mailing list
Subject: Re: [sqlite] "DISTINCT" makes a query take 37 times as long

On Fri, Feb 3, 2017 at 12:27 AM, Kevin O'Gorman <[hidden email]>
wrote:

> But the big thing is that I took a look at EXPLAIN QUERY PLAN using this
> ...

Maybe somebody can explain them to me, but it doesn't really matter whether
> I ever understand them. Perhaps Mr. Hipp can make use of them.
>

EXPLAIN, and EXPLAIN QUERY PLAN have completely different output.
EXPLAIN is basically the SQLite "assembler" code (or its "bytecode" if you
prefer),
while EXPLAIN QUERY PLAN gives you a much more human readable high-level
overview of the plan.

The low-level plans do look identical, but please also share the high-level
plan, should take you only a minute. --DD
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users