Simple read-only program very slow

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Simple read-only program very slow

Kevin O'Gorman
I'm pretty new at SQLite, so this may seem obvious to you.  Be kind.
I'm using Python on Ubuntu Linux 16.04 LTS, and the sqlite that is built
into Python.  The database
is using WAL.

I've got a database of some 100 million records, and a file of just over
300 thousand that I want represented in it.  I wanted to check how much
difference it was going to make, so I wrote a super
simple program to the read the file and count how many records are already
there.  I got impatient waiting for it so I killed the process and added an
output of one dot (".") per 1000 records.  It went very fast for what I
estimate was around 200 dots and hit a wall.  It made progress, but very
very slowly.

So I killed it again and added a commit() call every time it output a dot.
It didn't hit a wall, just some rough road (that is, it slowed down at
about the same spot but not nearly so drastically).

The code makes to changes to the database at all.  Why does commit() make a
difference?  What else should I learn from this?

The field being used for the lookup has an index.

++ kevin

Code follows:
#!/usr/bin/env python3
"""Count the number of records that represent rows in the database 'pos'
table.
The database is not modified.

 Last Modified: Sat Nov 25 18:56:49 PST 2017
"""

import os.path          # https://docs.python.org/3.5/library/os.path.html
import sys              # https://docs.python.org/3.5/library/sys.html
import argparse         # https://docs.python.org/3.5/library/argparse.html
import sqlite3          # https://docs.python.org/3.5/library/sqlite3.html
import re               # https://docs.python.org/3.5/library/re.html

# from /usr/local/lib/python3.5/dist-packages
import qcreate
from qerror import *
import myparser

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description="""A program to read
positions and count how many are
            in the database""",)
    parser.add_argument("--dbname", default=None,
            help="name of the database to work on (overrides qubic.ini
file)")
    parser.add_argument("file", nargs='?', type=argparse.FileType('r'),
default=sys.stdin,
                help="file containing the qsearch results (default stdin)")
    args=parser.parse_args()
    infile = args.file

    if args.dbname is None:
        here=os.path.split(os.path.realpath('.'))[1]
        for confdir in
".",os.path.join(os.environ["HOME"],".config"),"/etc/xdg/":
            f = os.path.join(confdir, "qubic.ini")
            if os.path.exists(f):
                args.dbname = myparser.parse(f, here, "dbname")
                if args.dbname is not None:
                    break
        if args.dbname is None:
            print(" *** ERROR: no database name provided and none found in
qubic.ini files")
            sys.exit(1)

    present = missing = lines = 0
    with sqlite3.connect(args.dbname) as conn:
        for line in infile:
            fields = line.split()
            pos = fields[0]
            if len(pos) != 64: # Important test to catch grep without
--no-filename
                raise InputError(" ERROR: input line has wrong-sized
position: " + line)

            pnum = None
            for row in conn.execute("""
                        SELECT pnum
                        FROM pos
                        WHERE ppos=?
                    """,(pos,)):
                pnum = row[0]
                break
            if pnum is None:
                missing += 1
            else:
                present += 1
            lines += 1
            if lines % 1000 == 0:
                print(".",flush=True,end="")
                conn.commit()
    print("there were",present,"records on file and",missing," were
missing")
    print("out of a total of", lines, "records.")



--
word of the year: *kakistocracy*
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Simple read-only program very slow

Clemens Ladisch
Kevin O'Gorman wrote:
> I wrote a super simple program to the read the file and count how many
> records are already there.  I got impatient waiting for it so I killed
> the process and added an output of one dot (".") per 1000 records.  It
> went very fast for what I estimate was around 200 dots and hit a wall.
> It made progress, but very very slowly.  [...]
> The field being used for the lookup has an index.

I'd guess that most records are found, and that the file and the table
happen to be somewhat sorted.  The search becomes slow when the amount
of data that needs to be read exceeds the available memory.

> Why does commit() make a difference?

Hmmm ... interesting.

>             for row in conn.execute("""
>                         SELECT pnum
>                         FROM pos
>                         WHERE ppos=?
>                     """,(pos,)):
>                 pnum = row[0]
>                 break
>             if pnum is None:
>                 missing += 1
>             else:
>                 present += 1

Even with the index on ppos, the DB still has to look up the table row
to read the pnum value.

You do not care about the actual pnum value, so you could replace it
with a constant value ("SELECT 1 FROM ...").  Or just use EXISTS to
show what you actually want to do:

        cursor = conn.execute("""
                SELECT EXISTS (
                        SELECT *
                        FROM pos
                        WHERE ppos = ?)
                """, (pos,))
        exists = cursor.fetchone()[0]
        if exists:
            present += 1
        else:
            missing += 1


Regards,
Clemens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Simple read-only program very slow

Simon Slavin-3
In reply to this post by Kevin O'Gorman


On 26 Nov 2017, at 3:13am, Kevin O'Gorman <[hidden email]> wrote:
>
> I've got a database of some 100 million records, and a file of just over
> 300 thousand that I want represented in it.  I wanted to check how much
> difference it was going to make, so I wrote a super
> simple program to the read the file and count how many records are already
> there.

You can use COUNT(*) to find out how many rows there are in a table.  SQL is optimized to handle this faster than reading individual row data.

SELECT COUNT(*) FROM pos

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Simple read-only program very slow

Kevin O'Gorman
On Sun, Nov 26, 2017 at 1:39 AM, Simon Slavin <[hidden email]> wrote:

>
>
> On 26 Nov 2017, at 3:13am, Kevin O'Gorman <[hidden email]> wrote:
> >
> > I've got a database of some 100 million records, and a file of just over
> > 300 thousand that I want represented in it.  I wanted to check how much
> > difference it was going to make, so I wrote a super
> > simple program to the read the file and count how many records are
> already
> > there.
>
> You can use COUNT(*) to find out how many rows there are in a table.  SQL
> is optimized to handle this faster than reading individual row data.
>
> SELECT COUNT(*) FROM pos
>
> Simon.
>

I know, but that does not help much when I'm trying to match the database
against a file, as I am here.

--
Dictionary.com's word of the year: *complicit*
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Simple read-only program very slow

Kevin O'Gorman
In reply to this post by Clemens Ladisch
On Sun, Nov 26, 2017 at 12:02 AM, Clemens Ladisch <[hidden email]>
wrote:

> Kevin O'Gorman wrote:
> > I wrote a super simple program to the read the file and count how many
> > records are already there.  I got impatient waiting for it so I killed
> > the process and added an output of one dot (".") per 1000 records.  It
> > went very fast for what I estimate was around 200 dots and hit a wall.
> > It made progress, but very very slowly.  [...]
> > The field being used for the lookup has an index.
>
> I'd guess that most records are found, and that the file and the table
> happen to be somewhat sorted.  The search becomes slow when the amount
> of data that needs to be read exceeds the available memory.
>
> > Why does commit() make a difference?
>
> Hmmm ... interesting.
>
> >             for row in conn.execute("""
> >                         SELECT pnum
> >                         FROM pos
> >                         WHERE ppos=?
> >                     """,(pos,)):
> >                 pnum = row[0]
> >                 break
> >             if pnum is None:
> >                 missing += 1
> >             else:
> >                 present += 1
>
> Even with the index on ppos, the DB still has to look up the table row
> to read the pnum value.
>
> You do not care about the actual pnum value, so you could replace it
> with a constant value ("SELECT 1 FROM ...").  Or just use EXISTS to
> show what you actually want to do:
>
>         cursor = conn.execute("""
>                 SELECT EXISTS (
>                         SELECT *
>                         FROM pos
>                         WHERE ppos = ?)
>                 """, (pos,))
>         exists = cursor.fetchone()[0]
>         if exists:
>             present += 1
>         else:
>             missing += 1
> <http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users>


That's real interesting and I'll keep it in mind for the future, but note
that my actual code DOES care about the contents of pmain, so it has to
look in the database anyway.  I'm still left wondering why it ran so slow
and why a commit() helped a read-only program running alone on the machine.

I'm gonna try this on my Xeon with 256 GB of RAM to check out the idea it
was running out of space.  It is true that the database is 50 GB so of
course it's bigger than the RAM on the usual desktop.


--
Dictionary.com's word of the year: *complicit*
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Simple read-only program very slow

Keith Medcalf
In reply to this post by Kevin O'Gorman

Is there an index on pos where ppos is the left-most field (or the only field) in the index?
What is the column affinity of ppos?  Of the fiold you are passing as a parameter?
Is ppos unique?

If you CREATE [UNIQUE] INDEX searchindex ON pos (ppos, pnum);

then your query can be satisfied only using the searchindex covering index.

If there is not an index on ppos, then you will be wasting time recreating the index for each query.

You will probably need to increase the cache size beyond the paltry default in order for the entire btree structures to be cached in RAM -- you probably want to make it as big as you can.

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.


>-----Original Message-----
>From: sqlite-users [mailto:sqlite-users-
>[hidden email]] On Behalf Of Kevin O'Gorman
>Sent: Saturday, 25 November, 2017 20:14
>To: sqlite-users
>Subject: [sqlite] Simple read-only program very slow
>
>I'm pretty new at SQLite, so this may seem obvious to you.  Be kind.
>I'm using Python on Ubuntu Linux 16.04 LTS, and the sqlite that is
>built
>into Python.  The database
>is using WAL.
>
>I've got a database of some 100 million records, and a file of just
>over
>300 thousand that I want represented in it.  I wanted to check how
>much
>difference it was going to make, so I wrote a super
>simple program to the read the file and count how many records are
>already
>there.  I got impatient waiting for it so I killed the process and
>added an
>output of one dot (".") per 1000 records.  It went very fast for what
>I
>estimate was around 200 dots and hit a wall.  It made progress, but
>very
>very slowly.
>
>So I killed it again and added a commit() call every time it output a
>dot.
>It didn't hit a wall, just some rough road (that is, it slowed down
>at
>about the same spot but not nearly so drastically).
>
>The code makes to changes to the database at all.  Why does commit()
>make a
>difference?  What else should I learn from this?
>
>The field being used for the lookup has an index.
>
>++ kevin
>
>Code follows:
>#!/usr/bin/env python3
>"""Count the number of records that represent rows in the database
>'pos'
>table.
>The database is not modified.
>
> Last Modified: Sat Nov 25 18:56:49 PST 2017
>"""
>
>import os.path          #
>https://docs.python.org/3.5/library/os.path.html
>import sys              #
>https://docs.python.org/3.5/library/sys.html
>import argparse         #
>https://docs.python.org/3.5/library/argparse.html
>import sqlite3          #
>https://docs.python.org/3.5/library/sqlite3.html
>import re               # https://docs.python.org/3.5/library/re.html
>
># from /usr/local/lib/python3.5/dist-packages
>import qcreate
>from qerror import *
>import myparser
>
>if __name__ == '__main__':
>    parser = argparse.ArgumentParser(description="""A program to read
>positions and count how many are
>            in the database""",)
>    parser.add_argument("--dbname", default=None,
>            help="name of the database to work on (overrides
>qubic.ini
>file)")
>    parser.add_argument("file", nargs='?',
>type=argparse.FileType('r'),
>default=sys.stdin,
>                help="file containing the qsearch results (default
>stdin)")
>    args=parser.parse_args()
>    infile = args.file
>
>    if args.dbname is None:
>        here=os.path.split(os.path.realpath('.'))[1]
>        for confdir in
>".",os.path.join(os.environ["HOME"],".config"),"/etc/xdg/":
>            f = os.path.join(confdir, "qubic.ini")
>            if os.path.exists(f):
>                args.dbname = myparser.parse(f, here, "dbname")
>                if args.dbname is not None:
>                    break
>        if args.dbname is None:
>            print(" *** ERROR: no database name provided and none
>found in
>qubic.ini files")
>            sys.exit(1)
>
>    present = missing = lines = 0
>    with sqlite3.connect(args.dbname) as conn:
>        for line in infile:
>            fields = line.split()
>            pos = fields[0]
>            if len(pos) != 64: # Important test to catch grep without
>--no-filename
>                raise InputError(" ERROR: input line has wrong-sized
>position: " + line)
>
>            pnum = None
>            for row in conn.execute("""
>                        SELECT pnum
>                        FROM pos
>                        WHERE ppos=?
>                    """,(pos,)):
>                pnum = row[0]
>                break
>            if pnum is None:
>                missing += 1
>            else:
>                present += 1
>            lines += 1
>            if lines % 1000 == 0:
>                print(".",flush=True,end="")
>                conn.commit()
>    print("there were",present,"records on file and",missing," were
>missing")
>    print("out of a total of", lines, "records.")
>
>
>
>--
>word of the year: *kakistocracy*
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Simple read-only program very slow

Richard Rousselot
Had similar issue a few years ago; we were using the SQLite3.exe.  We
recompiled the exe as 64 bit which allowed it to use more than 4 GB of ram,
loaded the machine with as much memory as we could.  Voila, super fast
processing.

Can the Python libraries be made 64 bit some how?

On Thu, Nov 30, 2017 at 7:01 PM Keith Medcalf <[hidden email]> wrote:

>
> Is there an index on pos where ppos is the left-most field (or the only
> field) in the index?
> What is the column affinity of ppos?  Of the fiold you are passing as a
> parameter?
> Is ppos unique?
>
> If you CREATE [UNIQUE] INDEX searchindex ON pos (ppos, pnum);
>
> then your query can be satisfied only using the searchindex covering index.
>
> If there is not an index on ppos, then you will be wasting time recreating
> the index for each query.
>
> You will probably need to increase the cache size beyond the paltry
> default in order for the entire btree structures to be cached in RAM -- you
> probably want to make it as big as you can.
>
> ---
> The fact that there's a Highway to Hell but only a Stairway to Heaven says
> a lot about anticipated traffic volume.
>
>
> >-----Original Message-----
> >From: sqlite-users [mailto:sqlite-users-
> >[hidden email]] On Behalf Of Kevin O'Gorman
> >Sent: Saturday, 25 November, 2017 20:14
> >To: sqlite-users
> >Subject: [sqlite] Simple read-only program very slow
> >
> >I'm pretty new at SQLite, so this may seem obvious to you.  Be kind.
> >I'm using Python on Ubuntu Linux 16.04 LTS, and the sqlite that is
> >built
> >into Python.  The database
> >is using WAL.
> >
> >I've got a database of some 100 million records, and a file of just
> >over
> >300 thousand that I want represented in it.  I wanted to check how
> >much
> >difference it was going to make, so I wrote a super
> >simple program to the read the file and count how many records are
> >already
> >there.  I got impatient waiting for it so I killed the process and
> >added an
> >output of one dot (".") per 1000 records.  It went very fast for what
> >I
> >estimate was around 200 dots and hit a wall.  It made progress, but
> >very
> >very slowly.
> >
> >So I killed it again and added a commit() call every time it output a
> >dot.
> >It didn't hit a wall, just some rough road (that is, it slowed down
> >at
> >about the same spot but not nearly so drastically).
> >
> >The code makes to changes to the database at all.  Why does commit()
> >make a
> >difference?  What else should I learn from this?
> >
> >The field being used for the lookup has an index.
> >
> >++ kevin
> >
> >Code follows:
> >#!/usr/bin/env python3
> >"""Count the number of records that represent rows in the database
> >'pos'
> >table.
> >The database is not modified.
> >
> > Last Modified: Sat Nov 25 18:56:49 PST 2017
> >"""
> >
> >import os.path          #
> >https://docs.python.org/3.5/library/os.path.html
> >import sys              #
> >https://docs.python.org/3.5/library/sys.html
> >import argparse         #
> >https://docs.python.org/3.5/library/argparse.html
> >import sqlite3          #
> >https://docs.python.org/3.5/library/sqlite3.html
> >import re               # https://docs.python.org/3.5/library/re.html
> >
> ># from /usr/local/lib/python3.5/dist-packages
> >import qcreate
> >from qerror import *
> >import myparser
> >
> >if __name__ == '__main__':
> >    parser = argparse.ArgumentParser(description="""A program to read
> >positions and count how many are
> >            in the database""",)
> >    parser.add_argument("--dbname", default=None,
> >            help="name of the database to work on (overrides
> >qubic.ini
> >file)")
> >    parser.add_argument("file", nargs='?',
> >type=argparse.FileType('r'),
> >default=sys.stdin,
> >                help="file containing the qsearch results (default
> >stdin)")
> >    args=parser.parse_args()
> >    infile = args.file
> >
> >    if args.dbname is None:
> >        here=os.path.split(os.path.realpath('.'))[1]
> >        for confdir in
> >".",os.path.join(os.environ["HOME"],".config"),"/etc/xdg/":
> >            f = os.path.join(confdir, "qubic.ini")
> >            if os.path.exists(f):
> >                args.dbname = myparser.parse(f, here, "dbname")
> >                if args.dbname is not None:
> >                    break
> >        if args.dbname is None:
> >            print(" *** ERROR: no database name provided and none
> >found in
> >qubic.ini files")
> >            sys.exit(1)
> >
> >    present = missing = lines = 0
> >    with sqlite3.connect(args.dbname) as conn:
> >        for line in infile:
> >            fields = line.split()
> >            pos = fields[0]
> >            if len(pos) != 64: # Important test to catch grep without
> >--no-filename
> >                raise InputError(" ERROR: input line has wrong-sized
> >position: " + line)
> >
> >            pnum = None
> >            for row in conn.execute("""
> >                        SELECT pnum
> >                        FROM pos
> >                        WHERE ppos=?
> >                    """,(pos,)):
> >                pnum = row[0]
> >                break
> >            if pnum is None:
> >                missing += 1
> >            else:
> >                present += 1
> >            lines += 1
> >            if lines % 1000 == 0:
> >                print(".",flush=True,end="")
> >                conn.commit()
> >    print("there were",present,"records on file and",missing," were
> >missing")
> >    print("out of a total of", lines, "records.")
> >
> >
> >
> >--
> >word of the year: *kakistocracy*
> >_______________________________________________
> >sqlite-users mailing list
> >[hidden email]
> >http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
>
>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Simple read-only program very slow

Keith Medcalf

On Sunday, 3 December, 2017 08:24, Richard Rousselot <[hidden email]> wrote:

>Had similar issue a few years ago; we were using the SQLite3.exe.  We
>recompiled the exe as 64 bit which allowed it to use more than 4 GB
>of ram, loaded the machine with as much memory as we could.  Voila,
>super fast processing.

>Can the Python libraries be made 64 bit some how?

Yes.  You must be using a 64-bit version of Python and the procedure to replace the sqlite3.dll / sqlite3.so it uses is the same as for the 32-bit version, or to compile and use a 64-bit version of the apsw extension is unchanged.

Neither Windows nor Linux can thunk a dynamic load module such that the one used is a different model than the running process (it was proprietary IBM technology that no one else seems smart enough to duplicate), so you have to update Python to the 64-bit model as well.

On Windows 10 16299.98 I have both a 32-bit (Python 2.7.14) and 64-bit (Python 3.6.4) installed and build 32-bit DLLs for the former and 64-bit for the latter from the same source (just selecting -m32 or -m64 as appropriate).  I use the MinGW64/GCC compiler because (a) it can compile in either model depending on the switch you use without requiring any code changes, (b) supports long long and long double in 32-bit; and, (c) does not require the use of the Microsoft C Runtime "moving target" libraries -- it can compile to the subsystem runtime (MSVCRT) that has been stable since, oh, the first OS/2 New Technology (which later became Windows NT) way back when.  Oh, and MinGW/GCC does "true" position independent code and when you do a static link of a module to either and executable or dynamic load library, it is truly static with no strange external dependencies.

Since SQLite3 is heavily I/O bound (or at least syscall/kernel call bound for mutexes, etc) in practically everything it does, the 64-bit version is much faster (about 20%) than the 32-bit version, when running on a 64-bit OS, since the OS does not have to thunk the call stack when accessing/returning from  the kernel.

>On Thu, Nov 30, 2017 at 7:01 PM Keith Medcalf <[hidden email]>
>wrote:
>
>>
>> Is there an index on pos where ppos is the left-most field (or the
>only
>> field) in the index?
>> What is the column affinity of ppos?  Of the fiold you are passing
>as a
>> parameter?
>> Is ppos unique?
>>
>> If you CREATE [UNIQUE] INDEX searchindex ON pos (ppos, pnum);
>>
>> then your query can be satisfied only using the searchindex
>covering index.
>>
>> If there is not an index on ppos, then you will be wasting time
>recreating
>> the index for each query.
>>
>> You will probably need to increase the cache size beyond the paltry
>> default in order for the entire btree structures to be cached in
>RAM -- you
>> probably want to make it as big as you can.
>>
>> ---
>> The fact that there's a Highway to Hell but only a Stairway to
>Heaven says
>> a lot about anticipated traffic volume.
>>
>>
>> >-----Original Message-----
>> >From: sqlite-users [mailto:sqlite-users-
>> >[hidden email]] On Behalf Of Kevin O'Gorman
>> >Sent: Saturday, 25 November, 2017 20:14
>> >To: sqlite-users
>> >Subject: [sqlite] Simple read-only program very slow
>> >
>> >I'm pretty new at SQLite, so this may seem obvious to you.  Be
>kind.
>> >I'm using Python on Ubuntu Linux 16.04 LTS, and the sqlite that is
>> >built
>> >into Python.  The database
>> >is using WAL.
>> >
>> >I've got a database of some 100 million records, and a file of
>just
>> >over
>> >300 thousand that I want represented in it.  I wanted to check how
>> >much
>> >difference it was going to make, so I wrote a super
>> >simple program to the read the file and count how many records are
>> >already
>> >there.  I got impatient waiting for it so I killed the process and
>> >added an
>> >output of one dot (".") per 1000 records.  It went very fast for
>what
>> >I
>> >estimate was around 200 dots and hit a wall.  It made progress,
>but
>> >very
>> >very slowly.
>> >
>> >So I killed it again and added a commit() call every time it
>output a
>> >dot.
>> >It didn't hit a wall, just some rough road (that is, it slowed
>down
>> >at
>> >about the same spot but not nearly so drastically).
>> >
>> >The code makes to changes to the database at all.  Why does
>commit()
>> >make a
>> >difference?  What else should I learn from this?
>> >
>> >The field being used for the lookup has an index.
>> >
>> >++ kevin
>> >
>> >Code follows:
>> >#!/usr/bin/env python3
>> >"""Count the number of records that represent rows in the database
>> >'pos'
>> >table.
>> >The database is not modified.
>> >
>> > Last Modified: Sat Nov 25 18:56:49 PST 2017
>> >"""
>> >
>> >import os.path          #
>> >https://docs.python.org/3.5/library/os.path.html
>> >import sys              #
>> >https://docs.python.org/3.5/library/sys.html
>> >import argparse         #
>> >https://docs.python.org/3.5/library/argparse.html
>> >import sqlite3          #
>> >https://docs.python.org/3.5/library/sqlite3.html
>> >import re               #
>https://docs.python.org/3.5/library/re.html
>> >
>> ># from /usr/local/lib/python3.5/dist-packages
>> >import qcreate
>> >from qerror import *
>> >import myparser
>> >
>> >if __name__ == '__main__':
>> >    parser = argparse.ArgumentParser(description="""A program to
>read
>> >positions and count how many are
>> >            in the database""",)
>> >    parser.add_argument("--dbname", default=None,
>> >            help="name of the database to work on (overrides
>> >qubic.ini
>> >file)")
>> >    parser.add_argument("file", nargs='?',
>> >type=argparse.FileType('r'),
>> >default=sys.stdin,
>> >                help="file containing the qsearch results (default
>> >stdin)")
>> >    args=parser.parse_args()
>> >    infile = args.file
>> >
>> >    if args.dbname is None:
>> >        here=os.path.split(os.path.realpath('.'))[1]
>> >        for confdir in
>> >".",os.path.join(os.environ["HOME"],".config"),"/etc/xdg/":
>> >            f = os.path.join(confdir, "qubic.ini")
>> >            if os.path.exists(f):
>> >                args.dbname = myparser.parse(f, here, "dbname")
>> >                if args.dbname is not None:
>> >                    break
>> >        if args.dbname is None:
>> >            print(" *** ERROR: no database name provided and none
>> >found in
>> >qubic.ini files")
>> >            sys.exit(1)
>> >
>> >    present = missing = lines = 0
>> >    with sqlite3.connect(args.dbname) as conn:
>> >        for line in infile:
>> >            fields = line.split()
>> >            pos = fields[0]
>> >            if len(pos) != 64: # Important test to catch grep
>without
>> >--no-filename
>> >                raise InputError(" ERROR: input line has wrong-
>sized
>> >position: " + line)
>> >
>> >            pnum = None
>> >            for row in conn.execute("""
>> >                        SELECT pnum
>> >                        FROM pos
>> >                        WHERE ppos=?
>> >                    """,(pos,)):
>> >                pnum = row[0]
>> >                break
>> >            if pnum is None:
>> >                missing += 1
>> >            else:
>> >                present += 1
>> >            lines += 1
>> >            if lines % 1000 == 0:
>> >                print(".",flush=True,end="")
>> >                conn.commit()
>> >    print("there were",present,"records on file and",missing,"
>were
>> >missing")
>> >    print("out of a total of", lines, "records.")
>> >
>> >
>> >
>> >--
>> >word of the year: *kakistocracy*
>> >_______________________________________________
>> >sqlite-users mailing list
>> >[hidden email]
>> >http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-
>users
>>
>>
>>
>> _______________________________________________
>> sqlite-users mailing list
>> [hidden email]
>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-
>users
>>
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Simple read-only program very slow

Kevin O'Gorman
On Sun, Dec 3, 2017 at 8:49 AM, Keith Medcalf <[hidden email]> wrote:

>
> On Sunday, 3 December, 2017 08:24, Richard Rousselot <
> [hidden email]> wrote:
>
> >Had similar issue a few years ago; we were using the SQLite3.exe.  We
> >recompiled the exe as 64 bit which allowed it to use more than 4 GB
> >of ram, loaded the machine with as much memory as we could.  Voila,
> >super fast processing.
>
> >Can the Python libraries be made 64 bit some how?
>
> Yes.  You must be using a 64-bit version of Python and the procedure to
> replace the sqlite3.dll / sqlite3.so it uses is the same as for the 32-bit
> version, or to compile and use a 64-bit version of the apsw extension is
> unchanged.
>
> Neither Windows nor Linux can thunk a dynamic load module such that the
> one used is a different model than the running process (it was proprietary
> IBM technology that no one else seems smart enough to duplicate), so you
> have to update Python to the 64-bit model as well.
>
> On Windows 10 16299.98 I have both a 32-bit (Python 2.7.14) and 64-bit
> (Python 3.6.4) installed and build 32-bit DLLs for the former and 64-bit
> for the latter from the same source (just selecting -m32 or -m64 as
> appropriate).  I use the MinGW64/GCC compiler because (a) it can compile in
> either model depending on the switch you use without requiring any code
> changes, (b) supports long long and long double in 32-bit; and, (c) does
> not require the use of the Microsoft C Runtime "moving target" libraries --
> it can compile to the subsystem runtime (MSVCRT) that has been stable
> since, oh, the first OS/2 New Technology (which later became Windows NT)
> way back when.  Oh, and MinGW/GCC does "true" position independent code and
> when you do a static link of a module to either and executable or dynamic
> load library, it is truly static with no strange external dependencies.
>
> Since SQLite3 is heavily I/O bound (or at least syscall/kernel call bound
> for mutexes, etc) in practically everything it does, the 64-bit version is
> much faster (about 20%) than the 32-bit version, when running on a 64-bit
> OS, since the OS does not have to thunk the call stack when
> accessing/returning from  the kernel.
>
> >On Thu, Nov 30, 2017 at 7:01 PM Keith Medcalf <[hidden email]>
> >wrote:
> >
> >>
> >> Is there an index on pos where ppos is the left-most field (or the
> >only
> >> field) in the index?
> >> What is the column affinity of ppos?  Of the fiold you are passing
> >as a
> >> parameter?
> >> Is ppos unique?
> >>
> >> If you CREATE [UNIQUE] INDEX searchindex ON pos (ppos, pnum);
> >>
> >> then your query can be satisfied only using the searchindex
> >covering index.
> >>
> >> If there is not an index on ppos, then you will be wasting time
> >recreating
> >> the index for each query.
> >>
> >> You will probably need to increase the cache size beyond the paltry
> >> default in order for the entire btree structures to be cached in
> >RAM -- you
> >> probably want to make it as big as you can.
> >>
> >> ---
> >> The fact that there's a Highway to Hell but only a Stairway to
> >Heaven says
> >> a lot about anticipated traffic volume.
> >>
> >>
> >> >-----Original Message-----
> >> >From: sqlite-users [mailto:sqlite-users-
> >> >[hidden email]] On Behalf Of Kevin O'Gorman
> >> >Sent: Saturday, 25 November, 2017 20:14
> >> >To: sqlite-users
> >> >Subject: [sqlite] Simple read-only program very slow
> >> >
> >> >I'm pretty new at SQLite, so this may seem obvious to you.  Be
> >kind.
> >> >I'm using Python on Ubuntu Linux 16.04 LTS, and the sqlite that is
> >> >built
> >> >into Python.  The database
> >> >is using WAL.
> >> >
> >> >I've got a database of some 100 million records, and a file of
> >just
> >> >over
> >> >300 thousand that I want represented in it.  I wanted to check how
> >> >much
> >> >difference it was going to make, so I wrote a super
> >> >simple program to the read the file and count how many records are
> >> >already
> >> >there.  I got impatient waiting for it so I killed the process and
> >> >added an
> >> >output of one dot (".") per 1000 records.  It went very fast for
> >what
> >> >I
> >> >estimate was around 200 dots and hit a wall.  It made progress,
> >but
> >> >very
> >> >very slowly.
> >> >
> >> >So I killed it again and added a commit() call every time it
> >output a
> >> >dot.
> >> >It didn't hit a wall, just some rough road (that is, it slowed
> >down
> >> >at
> >> >about the same spot but not nearly so drastically).
> >> >
> >> >The code makes to changes to the database at all.  Why does
> >commit()
> >> >make a
> >> >difference?  What else should I learn from this?
> >> >
> >> >The field being used for the lookup has an index.
> >> >
> >> >++ kevin
> >> >
> >> >Code follows:
> >> >#!/usr/bin/env python3
> >> >"""Count the number of records that represent rows in the database
> >> >'pos'
> >> >table.
> >> >The database is not modified.
> >> >
> >> > Last Modified: Sat Nov 25 18:56:49 PST 2017
> >> >"""
> >> >
> >> >import os.path          #
> >> >https://docs.python.org/3.5/library/os.path.html
> >> >import sys              #
> >> >https://docs.python.org/3.5/library/sys.html
> >> >import argparse         #
> >> >https://docs.python.org/3.5/library/argparse.html
> >> >import sqlite3          #
> >> >https://docs.python.org/3.5/library/sqlite3.html
> >> >import re               #
> >https://docs.python.org/3.5/library/re.html
> >> >
> >> ># from /usr/local/lib/python3.5/dist-packages
> >> >import qcreate
> >> >from qerror import *
> >> >import myparser
> >> >
> >> >if __name__ == '__main__':
> >> >    parser = argparse.ArgumentParser(description="""A program to
> >read
> >> >positions and count how many are
> >> >            in the database""",)
> >> >    parser.add_argument("--dbname", default=None,
> >> >            help="name of the database to work on (overrides
> >> >qubic.ini
> >> >file)")
> >> >    parser.add_argument("file", nargs='?',
> >> >type=argparse.FileType('r'),
> >> >default=sys.stdin,
> >> >                help="file containing the qsearch results (default
> >> >stdin)")
> >> >    args=parser.parse_args()
> >> >    infile = args.file
> >> >
> >> >    if args.dbname is None:
> >> >        here=os.path.split(os.path.realpath('.'))[1]
> >> >        for confdir in
> >> >".",os.path.join(os.environ["HOME"],".config"),"/etc/xdg/":
> >> >            f = os.path.join(confdir, "qubic.ini")
> >> >            if os.path.exists(f):
> >> >                args.dbname = myparser.parse(f, here, "dbname")
> >> >                if args.dbname is not None:
> >> >                    break
> >> >        if args.dbname is None:
> >> >            print(" *** ERROR: no database name provided and none
> >> >found in
> >> >qubic.ini files")
> >> >            sys.exit(1)
> >> >
> >> >    present = missing = lines = 0
> >> >    with sqlite3.connect(args.dbname) as conn:
> >> >        for line in infile:
> >> >            fields = line.split()
> >> >            pos = fields[0]
> >> >            if len(pos) != 64: # Important test to catch grep
> >without
> >> >--no-filename
> >> >                raise InputError(" ERROR: input line has wrong-
> >sized
> >> >position: " + line)
> >> >
> >> >            pnum = None
> >> >            for row in conn.execute("""
> >> >                        SELECT pnum
> >> >                        FROM pos
> >> >                        WHERE ppos=?
> >> >                    """,(pos,)):
> >> >                pnum = row[0]
> >> >                break
> >> >            if pnum is None:
> >> >                missing += 1
> >> >            else:
> >> >                present += 1
> >> >            lines += 1
> >> >            if lines % 1000 == 0:
> >> >                print(".",flush=True,end="")
> >> >                conn.commit()
> >> >    print("there were",present,"records on file and",missing,"
> >were
> >> >missing")
> >> >    print("out of a total of", lines, "records.")
> >> >
> >> >
> >> >
> >> >--
> >> >word of the year: *kakistocracy*
>
>
All of my machines are 64-bit, and run 64-bit Linux kernels.  Python3 is a
64-bit executable according to 'file'.

I re-ran the whole thing (slightly modified to also do a test on pmain) on
my big dual-Xeon (32 hyperthreads, 256GB! RAM ) from System76.  In spite of
having about half the CPU clock speed, and being single-thread, it ran
about 5 times faster.  I take this to mean that it really helps if the
entire database fits in the kernel memory buffers.  Kinda like putting it
on an SSD, only better.  I also take it to mean two other things:
1. I should have been more careful to make sure the system was in the same
state on every trial.  It makes a difference if part of the DB is already
in the buffers.
2. Most of the slowdown was buffer thrashing, caused by an access pattern
that had to read some parts of the database several times.

I no longer think this is any kind of a problem in SQLite.

--
Dictionary.com's word of the year: *complicit*
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Simple read-only program very slow

Keith Medcalf

>All of my machines are 64-bit, and run 64-bit Linux kernels.  Python3
>is a 64-bit executable according to 'file'.

>I re-ran the whole thing (slightly modified to also do a test on
>pmain) on my big dual-Xeon (32 hyperthreads, 256GB! RAM ) from
>System76.  In spite of having about half the CPU clock speed,
>and being single-thread, it ran about 5 times faster.  

>I take this to mean that it really helps if the entire database
>fits in the kernel memory buffers.  Kinda like putting it
>on an SSD, only better.  

There are multiple levels of caching.  There is the block I/O cache, the device cache, the OS filesystem cache and also the SQLite3 page cache.  Even if *all* the database fits in the OS filesystem cache, you will still have quite a bit of a slowdown if the SQLite3 page cache is not big enough to hold the "working set" of pages necessary to traverse the B-Tree structures (indexes and tables).  By default the SQLite3 page cache is a paltry 20000 pages (which at a 4K page size is probably big enough for a small database (a gig or two) to have decent performance).  This cache can "thrash" as well.

Thashing against the OS Filesystem cache is significantly faster than thrashing down to actual physical I/O, but is still significantly slower than if it does not thrash or thrashes within the I/O rate of the underlying physical storage/cache system.

If you read the hitrate and it is small (I define small as less than 99% -- you may want more thrashing) then your page cache (PRAGMA page_cache) is too small:

def hitrate(db):
    hits = db.status(apsw.SQLITE_DBSTATUS_CACHE_HIT)[0]
    misses = db.status(apsw.SQLITE_DBSTATUS_CACHE_MISS)[0]
    return hits * 100.0 / max(hits + misses, 1)

There is also of course a point at which it is more efficient to do an I/O (I/O time) than it is to search the page cache (CPU time).  Don't know where that is but of course there is such a point.  Truly you want to find the point at which SQLite3's page cache is sufficient to maintain the I/O rate to the next lower level as low as possible, and no lower.  Hard to guess what that number is and it costs big bucks to performance tune computers to balance workloads to keep every component of the system just ever so slightly under 100% utilized 100% of the time (that is, operating at peak speed).

>I also take it to mean two other things:

>1. I should have been more careful to make sure the system was
>in the same state on every trial.  It makes a difference if
>part of the DB is already in the buffers.

>2. Most of the slowdown was buffer thrashing, caused by an
>access pattern that had to read some parts of the database
>several times.

>I no longer think this is any kind of a problem in SQLite.

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.




_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users