Boosting insert and indexing performance for 10 billion rows (?)

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Boosting insert and indexing performance for 10 billion rows (?)

AJ Miles
Hi everyone -

I've been using SQLite through Python (3.7) for a scientific project. The
data comes out to 10 billion rows of an 8 byte signed integer (~200-300 gb
pre-index), and while insertion takes ~6 hours, indexing takes 8 hours by
itself. Indexing also seems to slow as it is built. Does this sound
reasonable for this amount of data? I'm running this on an SSD to try and
help the IO as best I can.

While I've been getting a lot of mileage out of this, I was wondering if
you had any tips on getting it to run faster. I've tried various PRAGMA
modifications to try and help the insert, but I'm wondering if there's
anything I can do to appreciably speed any of this up.

For my purposes, I don't need any sort of safeguards for power loss etc. -
I've already turned the journal and synchronous to off. This is a database
that will be built one time and accessed on occasion, and query speed is
fine as-is. The only things I can think about are perhaps partitioning the
table and running the indexing in parallel on the partitions, but this
seems clunky, especially with Python's 10-database ATTACH limit. The
parameter for modifying this is non-obvious in the Python package, and
since I haven't done the experiment, I don't know to what extent that would
help.

Thank you for any insight into this. The database is working fine as-is,
but I am trying to see for the sake of convenience and education if I can
get it to insert and/or index faster.

Cheers,

-AJ
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Boosting insert and indexing performance for 10 billion rows (?)

Keith Medcalf

Have you increased the paltry default cache size?  (PRAGMA CACHE_SIZE)  The bigger the better, especially since you are sorting and balancing large B-Tree's.  The more this can be done in memory without having to spill to slow disk (or disk cache) the faster it will go ... (the best way to optimize I/O is to not do it)

You can also enable multithreaded merge sorts.  It has been my experience though that enabling multithreaded merge sorts slows things down rather than speeds them up however YMMV.  You might just have a database size that is in the realm of scale the multithreaded sorter was intended to address.

https://sqlite.org/pragma.html#pragma_threads


---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.


>-----Original Message-----
>From: sqlite-users [mailto:sqlite-users-
>[hidden email]] On Behalf Of AJ M
>Sent: Wednesday, 28 November, 2018 10:03
>To: [hidden email]
>Subject: [sqlite] Boosting insert and indexing performance for 10
>billion rows (?)
>
>Hi everyone -
>
>I've been using SQLite through Python (3.7) for a scientific project.
>The
>data comes out to 10 billion rows of an 8 byte signed integer (~200-
>300 gb
>pre-index), and while insertion takes ~6 hours, indexing takes 8
>hours by
>itself. Indexing also seems to slow as it is built. Does this sound
>reasonable for this amount of data? I'm running this on an SSD to try
>and
>help the IO as best I can.
>
>While I've been getting a lot of mileage out of this, I was wondering
>if
>you had any tips on getting it to run faster. I've tried various
>PRAGMA
>modifications to try and help the insert, but I'm wondering if
>there's
>anything I can do to appreciably speed any of this up.
>
>For my purposes, I don't need any sort of safeguards for power loss
>etc. -
>I've already turned the journal and synchronous to off. This is a
>database
>that will be built one time and accessed on occasion, and query speed
>is
>fine as-is. The only things I can think about are perhaps
>partitioning the
>table and running the indexing in parallel on the partitions, but
>this
>seems clunky, especially with Python's 10-database ATTACH limit. The
>parameter for modifying this is non-obvious in the Python package,
>and
>since I haven't done the experiment, I don't know to what extent that
>would
>help.
>
>Thank you for any insight into this. The database is working fine as-
>is,
>but I am trying to see for the sake of convenience and education if I
>can
>get it to insert and/or index faster.
>
>Cheers,
>
>-AJ
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Boosting insert and indexing performance for 10 billion rows (?)

David Raymond
In reply to this post by AJ Miles
Sounds like you're already doing the few things I would have recommended.

As an FYI in regards to the attached database limitation, you can compile your own .dll with a higher number for SQLITE_MAX_ATTACHED, up to 125
https://www.sqlite.org/limits.html#max_attached
and then swap out the default Python sqlite3.dll for yours in the <something>\Python37\DLLs folder.
(Or at least you can in Windows land. No clue about other OS's)



-----Original Message-----
From: sqlite-users [mailto:[hidden email]] On Behalf Of AJ M
Sent: Wednesday, November 28, 2018 12:03 PM
To: [hidden email]
Subject: [sqlite] Boosting insert and indexing performance for 10 billion rows (?)

Hi everyone -

I've been using SQLite through Python (3.7) for a scientific project. The
data comes out to 10 billion rows of an 8 byte signed integer (~200-300 gb
pre-index), and while insertion takes ~6 hours, indexing takes 8 hours by
itself. Indexing also seems to slow as it is built. Does this sound
reasonable for this amount of data? I'm running this on an SSD to try and
help the IO as best I can.

While I've been getting a lot of mileage out of this, I was wondering if
you had any tips on getting it to run faster. I've tried various PRAGMA
modifications to try and help the insert, but I'm wondering if there's
anything I can do to appreciably speed any of this up.

For my purposes, I don't need any sort of safeguards for power loss etc. -
I've already turned the journal and synchronous to off. This is a database
that will be built one time and accessed on occasion, and query speed is
fine as-is. The only things I can think about are perhaps partitioning the
table and running the indexing in parallel on the partitions, but this
seems clunky, especially with Python's 10-database ATTACH limit. The
parameter for modifying this is non-obvious in the Python package, and
since I haven't done the experiment, I don't know to what extent that would
help.

Thank you for any insight into this. The database is working fine as-is,
but I am trying to see for the sake of convenience and education if I can
get it to insert and/or index faster.

Cheers,

-AJ
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Boosting insert and indexing performance for 10 billion rows (?)

AJ Miles
Thank you for the good suggestions. I've been applying them to a smaller
subset of my database to see how it might perform.

I had tried fiddling with the cache size but it seemed to make performance
slightly degrade in my case. In principle it should work, so perhaps my
smaller database isn't entirely representative of the larger one. I will
test it out fully later. However, the threading pragma is showing a pretty
dramatic (3-4x increase in speed) improvement, so I think this will be a
huge benefit. For some reason I had overlooked that setting.

The attach limit looks like it would be good to try at some point. I am
using Windows, but unfortunately I am preparing this database for use in a
tool that will be shared with other installations/operating systems. In
order to keep it extremely simple for the end user, I'm sticking to whatever
I can rig through the pre-compiled Python version, but for my own personal
projects I will try to fiddle with attach. The ability to split the database
into multiple files will come in handy if I end up tackling anything larger.

-AJ



--
Sent from: http://sqlite.1065341.n5.nabble.com/
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Boosting insert and indexing performance for 10 billion rows (?)

Simon Slavin-3
In reply to this post by AJ Miles
Ignore multiprocessing for the inserting.  You'll just get contention when accessing the database.  And I think you are already trying the right PRAGMAs.

I think you've done this already, but just in case ...

Insert rows in batches.  Experiment with the batch size: maybe a thousand INSERTs per batch, maybe a million.  Try all the orders in between.  So ...

DROP all INDEXes
BEGIN
   INSERT first thousand rows
END
BEGIN
   INSERT next thousand rows
END
...
CREATE all INDEXes

Try that and time it.  Then change 'thousand' to 'ten thousand', up to a million.  Curve the times and figure out where your sweet spot is.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Boosting insert and indexing performance for 10 billion rows (?)

Dominique Devienne
In reply to this post by AJ Miles
On Wed, Nov 28, 2018 at 6:03 PM AJ M <[hidden email]> wrote:

> [...] The data comes out to 10 billion rows of an 8 byte signed integer

(~200-300 gb pre-index), and while insertion takes ~6 hours, indexing takes
> 8 hours by
> itself. [...] query speed is fine as-is. [...]
>

Hi AJ. Your message is quite intriguing, because you make it sound like
your row
is composed of a single 8-byte signed integer. Even multiplied by 1e10
rows, that's only
80GB ideally, so 200-300GB pre-indexing means a large 3x overhead in the
DB, which doesn't sound right.

Also, a row composed of a single integer column is not that interesting at
first sight, and a SQL
DB does not seem appropriate for such a simple data "structure". What kind
of query would
you be running on that one signed integer? Surely you have other columns in
your DB?
What's the natural or primary key of those rows?

So far you got answers on your specific question, but if we backed up a
little and got more context
on what you are trying to achieve at a higher level, your exact table(s)
structures and indexes,
and the kind of queries you are running? I'm sure you'd get a different
perspective on your
problem, which may even not be related to SQLite at all I kinda suspect. My
$0.02. --DD
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Boosting insert and indexing performance for 10 billion rows (?)

Arun - Siara Logics (cc)
Without going into details of what your structure is, I suggest you look into "without rowid" option when creating the table. It reduces the overhead drastically.

 ---- On Thu, 29 Nov 2018 18:59:26 +0530 Dominique Devienne <[hidden email]> wrote ----
 > On Wed, Nov 28, 2018 at 6:03 PM AJ M <[hidden email]> wrote:
 >
 > > [...] The data comes out to 10 billion rows of an 8 byte signed integer
 >
 > (~200-300 gb pre-index), and while insertion takes ~6 hours, indexing takes
 > > 8 hours by
 > > itself. [...] query speed is fine as-is. [...]
 > >
 >
 > Hi AJ. Your message is quite intriguing, because you make it sound like
 > your row
 > is composed of a single 8-byte signed integer. Even multiplied by 1e10
 > rows, that's only
 > 80GB ideally, so 200-300GB pre-indexing means a large 3x overhead in the
 > DB, which doesn't sound right.
 >
 > Also, a row composed of a single integer column is not that interesting at
 > first sight, and a SQL
 > DB does not seem appropriate for such a simple data "structure". What kind
 > of query would
 > you be running on that one signed integer? Surely you have other columns in
 > your DB?
 > What's the natural or primary key of those rows?
 >
 > So far you got answers on your specific question, but if we backed up a
 > little and got more context
 > on what you are trying to achieve at a higher level, your exact table(s)
 > structures and indexes,
 > and the kind of queries you are running? I'm sure you'd get a different
 > perspective on your
 > problem, which may even not be related to SQLite at all I kinda suspect. My
 > $0.02. --DD
 > _______________________________________________
 > sqlite-users mailing list
 > [hidden email]
 > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
 >


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Boosting insert and indexing performance for 10 billion rows (?)

AJ Miles
Simon, Dominique, and Arun -

Thank you for the feedback. I'll leave the threading off for inserts since
I've seen DB contention issues with other multithreaded/multiprocessed
attempts. The indexing improvement is nice though.

I misspoke when I said it was 200-300gb for just the integers -- my brain
was a little fuzzy. Right now, the integer table has 3 columns: rowid,
integer, and foreign row id to a second table (so 8byte int, 8 byte int,
variable byte int I believe, unless the rowid is also a variably-sized int).
The rowid is left in to prevent using it as a primary key, or enforcing a
unique constraint upfront which would result in an extremely slow insertion
if I understand the basics correctly (and based on my own tests). This works
out to about 180 gb of the table.

There is a second table which stores some text information at 150 million
rows, and then references my integer table. This makes up some of the
difference. These rows are probably on the order of 200-300 bytes each so if
I do the math, something in the 210-250 gb range seems reasonable.

-AJ



--
Sent from: http://sqlite.1065341.n5.nabble.com/
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Boosting insert and indexing performance for 10 billion rows (?)

Richard Hipp-3
On 11/29/18, AJ Miles <[hidden email]> wrote:

>
> I misspoke when I said it was 200-300gb for just the integers -- my brain
> was a little fuzzy. Right now, the integer table has 3 columns: rowid,
> integer, and foreign row id to a second table (so 8byte int, 8 byte int,
> variable byte int I believe, unless the rowid is also a variably-sized int).
> The rowid is left in to prevent using it as a primary key, or enforcing a
> unique constraint upfront which would result in an extremely slow insertion
> if I understand the basics correctly (and based on my own tests). This works
> out to about 180 gb of the table.
>
> There is a second table which stores some text information at 150 million
> rows, and then references my integer table. This makes up some of the
> difference. These rows are probably on the order of 200-300 bytes each so if
> I do the math, something in the 210-250 gb range seems reasonable.
>

Just for grins, consider downloading the "bundle of command-line tools
for managing SQLite database files" for your platform from
https://sqlite.org/download.html and running the "sqlite3_analyzer"
tool against your database.  That tool will print out the average
number of bytes used for each row of each table, along with additional
information about how content is laid out in your database.  Maybe
post the results here.  The utility will run at 1-5 GB/s so it might
take a minute or three for it to complete.

     sqlite3_analyzer  your-big-file.db >analysis.txt

Then copy/paste analysis.txt into an email to this list.

--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Boosting insert and indexing performance for 10 billion rows (?)

AJ Miles
Ah, this tool seems very handy. For those curious, I'll paste the results
below. The index approximately doubles the storage size, but I am
intentionally making that tradeoff to avoid the slow down when enforcing a
unique/primary key on the Reference table while inserting.

-AJ

/** Disk-Space Utilization Report For F:/mytestdb2.db

Page size in bytes................................ 4096      
Pages in the whole file (measured)................ 104071490
Pages in the whole file (calculated).............. 104071489
Pages that store data............................. 104071489  100.000%
Pages on the freelist (per header)................ 0            0.0%
Pages on the freelist (calculated)................ 1            0.0%
Pages of auto-vacuum overhead..................... 0            0.0%
Number of tables in the database.................. 3        
Number of indices................................. 1        
Number of defined indices......................... 1        
Number of implied indices......................... 0        
Size of the file in bytes......................... 426276823040
Bytes of user payload stored...................... 174138410641  40.9%

*** Page counts for all tables with their indices
*****************************

REFERENCE......................................... 93008188    89.4%
MAINDATA.......................................... 11063300    10.6%
SQLITE_MASTER..................................... 1            0.0%

*** Page counts for all tables and indices separately
*************************

REFERENCE......................................... 50304534    48.3%
HASHINDEX......................................... 42703654    41.0%
MAINDATA.......................................... 11063300    10.6%
SQLITE_MASTER..................................... 1            0.0%

*** All tables and indices
****************************************************

Percentage of total database...................... 100.000%  
Number of entries................................. 17948049998
Bytes of storage consumed......................... 426276818944
Bytes of payload.................................. 321412979244  75.4%
Bytes of metadata................................. 100378462716  23.5%
Average payload per entry......................... 17.91    
Average unused bytes per entry.................... 0.25      
Average metadata per entry........................ 5.59      
Average fanout.................................... 241.00    
Maximum payload per entry......................... 37003    
Entries that use overflow......................... 53547        0.0%
Index pages used.................................. 430147    
Primary pages used................................ 103581347
Overflow pages used............................... 59995    
Total pages used.................................. 104071489
Unused bytes on index pages....................... 96815925     5.5%
Unused bytes on primary pages..................... 4383344069   1.0%
Unused bytes on overflow pages.................... 5242782      2.1%
Unused bytes on all pages......................... 4485402776   1.1%

*** All tables
****************************************************************

Percentage of total database......................  59.0%    
Number of entries................................. 9031683455
Bytes of storage consumed......................... 251362652160
Bytes of payload.................................. 174138410911  69.3%
Bytes of metadata................................. 73116919243  29.1%
Average payload per entry......................... 19.28    
Average unused bytes per entry.................... 0.45      
Average metadata per entry........................ 8.10      
Average fanout.................................... 332.00    
Maximum payload per entry......................... 37003    
Entries that use overflow......................... 53547        0.0%
Index pages used.................................. 184608    
Primary pages used................................ 61123232  
Overflow pages used............................... 59995    
Total pages used.................................. 61367835  
Unused bytes on index pages....................... 94142971    12.5%
Unused bytes on primary pages..................... 4007962045   1.6%
Unused bytes on overflow pages.................... 5242782      2.1%
Unused bytes on all pages......................... 4107347798   1.6%

*** All indices
***************************************************************

Percentage of total database......................  41.0%    
Number of entries................................. 8916366543
Bytes of storage consumed......................... 174914166784
Bytes of payload.................................. 147274568333  84.2%
Bytes of metadata................................. 27261543473  15.6%
B-tree depth...................................... 5        
Average payload per entry......................... 16.52    
Average unused bytes per entry.................... 0.04      
Average metadata per entry........................ 3.06      
Average fanout.................................... 173.00    
Maximum payload per entry......................... 17        
Entries that use overflow......................... 0            0.0%
Index pages used.................................. 245539    
Primary pages used................................ 42458115  
Overflow pages used............................... 0        
Total pages used.................................. 42703654  
Unused bytes on index pages....................... 2672954      0.27%
Unused bytes on primary pages..................... 375382024    0.22%
Unused bytes on overflow pages.................... 0        
Unused bytes on all pages......................... 378054978    0.22%

*** Table MAINDATA
************************************************************

Percentage of total database......................  10.6%    
Number of entries................................. 115316909
Bytes of storage consumed......................... 45315276800
Bytes of payload.................................. 41033664786  90.6%
Bytes of metadata................................. 1103219478   2.4%
B-tree depth...................................... 4        
Average payload per entry......................... 355.83    
Average unused bytes per entry.................... 27.56    
Average metadata per entry........................ 9.57      
Average fanout.................................... 358.00    
Non-sequential pages.............................. 10954556    99.02%
Maximum payload per entry......................... 37003    
Entries that use overflow......................... 53547        0.046%
Index pages used.................................. 30681    
Primary pages used................................ 10972624  
Overflow pages used............................... 59995    
Total pages used.................................. 11063300  
Unused bytes on index pages....................... 15769540    12.5%
Unused bytes on primary pages..................... 3157406006   7.0%
Unused bytes on overflow pages.................... 5242782      2.1%
Unused bytes on all pages......................... 3178418328   7.0%

*** Table REFERENCE and all its indices
***************************************

Percentage of total database......................  89.4%    
Number of entries................................. 17832733086
Bytes of storage consumed......................... 380961538048
Bytes of payload.................................. 280379314188  73.6%
Bytes of metadata................................. 99275243118  26.1%
Average payload per entry......................... 15.72    
Average unused bytes per entry.................... 0.07      
Average metadata per entry........................ 5.57      
Average fanout.................................... 232.00    
Maximum payload per entry......................... 17        
Entries that use overflow......................... 0            0.0%
Index pages used.................................. 399466    
Primary pages used................................ 92608722  
Overflow pages used............................... 0        
Total pages used.................................. 93008188  
Unused bytes on index pages....................... 81046385     5.0%
Unused bytes on primary pages..................... 1225934357   0.32%
Unused bytes on overflow pages.................... 0        
Unused bytes on all pages......................... 1306980742   0.34%

*** Table REFERENCE w/o any indices
*******************************************

Percentage of total database......................  48.3%    
Number of entries................................. 8916366543
Bytes of storage consumed......................... 206047371264
Bytes of payload.................................. 133104745855  64.6%
Bytes of metadata................................. 72013699645  35.0%
B-tree depth...................................... 5        
Average payload per entry......................... 14.93    
Average unused bytes per entry.................... 0.10      
Average metadata per entry........................ 8.08      
Average fanout.................................... 326.00    
Non-sequential pages.............................. 11004959    21.9%
Maximum payload per entry......................... 15        
Entries that use overflow......................... 0            0.0%
Index pages used.................................. 153927    
Primary pages used................................ 50150607  
Overflow pages used............................... 0        
Total pages used.................................. 50304534  
Unused bytes on index pages....................... 78373431    12.4%
Unused bytes on primary pages..................... 850552333    0.41%
Unused bytes on overflow pages.................... 0        
Unused bytes on all pages......................... 928925764    0.45%

*** Index HASHINDEX of table REFERENCE
****************************************

Percentage of total database......................  41.0%    
Number of entries................................. 8916366543
Bytes of storage consumed......................... 174914166784
Bytes of payload.................................. 147274568333  84.2%
Bytes of metadata................................. 27261543473  15.6%
B-tree depth...................................... 5        
Average payload per entry......................... 16.52    
Average unused bytes per entry.................... 0.04      
Average metadata per entry........................ 3.06      
Average fanout.................................... 173.00    
Non-sequential pages.............................. 0            0.0%
Maximum payload per entry......................... 17        
Entries that use overflow......................... 0            0.0%
Index pages used.................................. 245539    
Primary pages used................................ 42458115  
Overflow pages used............................... 0        
Total pages used.................................. 42703654  
Unused bytes on index pages....................... 2672954      0.27%
Unused bytes on primary pages..................... 375382024    0.22%
Unused bytes on overflow pages.................... 0        
Unused bytes on all pages......................... 378054978    0.22%

*** Table SQLITE_MASTER
*******************************************************

Percentage of total database......................   0.0%    
Number of entries................................. 3        
Bytes of storage consumed......................... 4096      
Bytes of payload.................................. 270          6.6%
Bytes of metadata................................. 120          2.9%
B-tree depth...................................... 1        
Average payload per entry......................... 90.00    
Average unused bytes per entry.................... 1235.33  
Average metadata per entry........................ 40.00    
Maximum payload per entry......................... 94        
Entries that use overflow......................... 0            0.0%
Primary pages used................................ 1        
Overflow pages used............................... 0        
Total pages used.................................. 1        
Unused bytes on primary pages..................... 3706        90.5%
Unused bytes on overflow pages.................... 0        
Unused bytes on all pages......................... 3706        90.5%

*** Definitions
***************************************************************

Page size in bytes

    The number of bytes in a single page of the database file.  
    Usually 1024.

Number of pages in the whole file

    The number of 4096-byte pages that go into forming the complete
    database

Pages that store data

    The number of pages that store data, either as primary B*Tree pages or
    as overflow pages.  The number at the right is the data pages divided by
    the total number of pages in the file.

Pages on the freelist

    The number of pages that are not currently in use but are reserved for
    future use.  The percentage at the right is the number of freelist pages
    divided by the total number of pages in the file.

Pages of auto-vacuum overhead

    The number of pages that store data used by the database to facilitate
    auto-vacuum. This is zero for databases that do not support auto-vacuum.

Number of tables in the database

    The number of tables in the database, including the SQLITE_MASTER table
    used to store schema information.

Number of indices

    The total number of indices in the database.

Number of defined indices

    The number of indices created using an explicit CREATE INDEX statement.

Number of implied indices

    The number of indices used to implement PRIMARY KEY or UNIQUE
constraints
    on tables.

Size of the file in bytes

    The total amount of disk space used by the entire database files.

Bytes of user payload stored

    The total number of bytes of user payload stored in the database. The
    schema information in the SQLITE_MASTER table is not counted when
    computing this number.  The percentage at the right shows the payload
    divided by the total file size.

Percentage of total database

    The amount of the complete database file that is devoted to storing
    information described by this category.

Number of entries

    The total number of B-Tree key/value pairs stored under this category.

Bytes of storage consumed

    The total amount of disk space required to store all B-Tree entries
    under this category.  The is the total number of pages used times
    the pages size.

Bytes of payload

    The amount of payload stored under this category.  Payload is the data
    part of table entries and the key part of index entries.  The percentage
    at the right is the bytes of payload divided by the bytes of storage
    consumed.

Bytes of metadata

    The amount of formatting and structural information stored in the
    table or index.  Metadata includes the btree page header, the cell
pointer
    array, the size field for each cell, the left child pointer or non-leaf
    cells, the overflow pointers for overflow cells, and the rowid value for
    rowid table cells.  In other words, metadata is everything that is
neither
    unused space nor content.  The record header in the payload is counted
as
    content, not metadata.

Average payload per entry

    The average amount of payload on each entry.  This is just the bytes of
    payload divided by the number of entries.

Average unused bytes per entry

    The average amount of free space remaining on all pages under this
    category on a per-entry basis.  This is the number of unused bytes on
    all pages divided by the number of entries.

Non-sequential pages

    The number of pages in the table or index that are out of sequence.
    Many filesystems are optimized for sequential file access so a small
    number of non-sequential pages might result in faster queries,
    especially for larger database files that do not fit in the disk cache.
    Note that after running VACUUM, the root page of each table or index is
    at the beginning of the database file and all other pages are in a
    separate part of the database file, resulting in a single non-
    sequential page.

Maximum payload per entry

    The largest payload size of any entry.

Entries that use overflow

    The number of entries that user one or more overflow pages.

Total pages used

    This is the number of pages used to hold all information in the current
    category.  This is the sum of index, primary, and overflow pages.

Index pages used

    This is the number of pages in a table B-tree that hold only key (rowid)
    information and no data.

Primary pages used

    This is the number of B-tree pages that hold both key and data.

Overflow pages used

    The total number of overflow pages used for this category.

Unused bytes on index pages

    The total number of bytes of unused space on all index pages.  The
    percentage at the right is the number of unused bytes divided by the
    total number of bytes on index pages.

Unused bytes on primary pages

    The total number of bytes of unused space on all primary pages.  The
    percentage at the right is the number of unused bytes divided by the
    total number of bytes on primary pages.

Unused bytes on overflow pages

    The total number of bytes of unused space on all overflow pages.  The
    percentage at the right is the number of unused bytes divided by the
    total number of bytes on overflow pages.

Unused bytes on all pages

    The total number of bytes of unused space on all primary and overflow
    pages.  The percentage at the right is the number of unused bytes
    divided by the total number of bytes.

*******************************************************************************
The entire text of this report can be sourced into any SQL database
engine for further analysis.  All of the text above is an SQL comment.
The data used to generate this report follows:
*/
BEGIN;
CREATE TABLE space_used(
   name clob,        -- Name of a table or index in the database file
   tblname clob,     -- Name of associated table
   is_index boolean, -- TRUE if it is an index, false for a table
   is_without_rowid boolean, -- TRUE if WITHOUT ROWID table  
   nentry int,       -- Number of entries in the BTree
   leaf_entries int, -- Number of leaf entries
   depth int,        -- Depth of the b-tree
   payload int,      -- Total amount of data stored in this table or index
   ovfl_payload int, -- Total amount of data stored on overflow pages
   ovfl_cnt int,     -- Number of entries that use overflow
   mx_payload int,   -- Maximum payload size
   int_pages int,    -- Number of interior pages used
   leaf_pages int,   -- Number of leaf pages used
   ovfl_pages int,   -- Number of overflow pages used
   int_unused int,   -- Number of unused bytes on interior pages
   leaf_unused int,  -- Number of unused bytes on primary pages
   ovfl_unused int,  -- Number of unused bytes on overflow pages
   gap_cnt int,      -- Number of gaps in the page layout
   compressed_size int  -- Total bytes stored on disk
);
INSERT INTO space_used
VALUES('sqlite_master','sqlite_master',0,0,3,3,1,270,0,0,94,0,1,0,0,3706,0,0,4096);
INSERT INTO space_used VALUES

('MainData','MainData',0,0,126289532,115316909,4,41033664786,240256758,53547,37003,30681,10972624,59995,15769540,3157406006,

5242782,10954556,45315276800);
INSERT INTO space_used VALUES

('Reference','Reference',0,0,8966517149,8916366543,5,133104745855,0,0,15,153927,50150607,0,78373431,850552333,0,11004959,206

047371264);
INSERT INTO space_used VALUES

('HashIndex','Reference',1,0,8916366543,8873908429,5,147274568333,0,0,17,245539,42458115,0,2672954,375382024,0,0,17491416678

4);
COMMIT;




--
Sent from: http://sqlite.1065341.n5.nabble.com/
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Boosting insert and indexing performance for 10 billion rows (?)

E.Pasma

>  30 nov. 2018, AJ Miles:
>
> Ah, this tool seems very handy. For those curious, I'll paste the results
> below. The index approximately doubles the storage size, but I am
> intentionally making that tradeoff to avoid the slow down when enforcing a
> unique/primary key on the Reference table while inserting.
>
> -AJ

Hello,

is the "integer" column unique and a potential primary key? Then it would be surprising that defining it as such causes a slowdown. Assuming the table is then defined WITHOUT ROWID. And that the multithreading pragma is set.

About sqlite_analyzer: on macOS 10.13.6 I got
..Tcl.framework/Versions/8.4.. image not found
After editing the binary file as below, it is alright.
vi sqlite3_analyzer
:%s/8\.4/8.5/
:x

Thanks for the intriguing subject. E. Pasma.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Boosting insert and indexing performance for 10 billion rows (?)

E.Pasma

> 2 dec. 2018, E.Pasma:
>
>> 30 nov. 2018, AJ Miles:
>>
>> Ah, this tool seems very handy. For those curious, I'll paste the results
>> below. The index approximately doubles the storage size, but I am
>> intentionally making that tradeoff to avoid the slow down when enforcing a
>> unique/primary key on the Reference table while inserting.
>>
>> -AJ
>
> Hello,
>
> is the "integer" column unique and a potential primary key? Then it would be surprising that defining it as such causes a slowdown. Assuming the table is then defined WITHOUT ROWID. And that the multithreading pragma is set.
>
> About sqlite_analyzer: on macOS 10.13.6 I got
> ..Tcl.framework/Versions/8.4.. image not found
> After editing the binary file as below, it is alright.
> vi sqlite3_analyzer
> :%s/8\.4/8.5/
> :x
>
> Thanks for the intriguing subject. E. Pasma.
Sorry,

In a small test, I measured inserting in a table with an INTEGER PRIMARY KEY and without rowid.
Unfortunately, it causes a slowdown of a factor 5, compared to your schema.

In my test, the rows are inserted in random order. If inserted in the order of the key value, the result is different altogether. But it looks hard to achieve that with billions of rows.

Thanks, E. Pasma




_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Boosting insert and indexing performance for 10 billion rows (?)

Keith Medcalf
In reply to this post by E.Pasma

Well if it is unique and not null, then why not just make it the rowid?  In either case, you would still have to permute the storage tree at insert time if the inserts were not in-order.  So let us compare them shall we:

sqlite> create table x(value INTEGER PRIMARY KEY);
sqlite> insert into x select random() from generate_series where start=1 and stop=100000000;
Run Time: real 185.795 user 184.265625 sys 0.343750

sqlite> create table x(value INTEGER PRIMARY KEY) WITHOUT ROWID;
sqlite> insert into x select random() from generate_series where start=1 and stop=100000000;
Run Time: real 174.661 user 173.890625 sys 0.000000

sqlite> create table x(value INTEGER NOT NULL);
sqlite> insert into x select random() from generate_series where start=1 and stop=100000000;
Run Time: real 20.287 user 20.265625 sys 0.000000
sqlite> create unique index ix on x(value);
Run Time: real 211.556 user 207.562500 sys 2.562500

sqlite> create table x(value INTEGER NOT NULL);
sqlite> create unique index ix on x(value);
sqlite> insert into x select random() from generate_series where start=1 and stop=100000000;
Run Time: real 196.719 user 195.437500 sys 0.000000

So, the winner is (in this case, with no other payload) that the INTEGER PRIMARY KEY on a WITHOUT ROWID table is the most time efficient, followed by being the INTEGER PRIMARY KEY of a ROWID table.  Those two also happen to be the most space-efficient as well.  Interestingly it was overall faster to build and maintain the separate index at insert time in this example than to build the index seperately, but not by much.  Note this is for 100,000,000 records processed entitely in RAM in a single transaction ...

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.

>-----Original Message-----
>From: sqlite-users [mailto:sqlite-users-
>[hidden email]] On Behalf Of E.Pasma
>Sent: Sunday, 2 December, 2018 03:38
>To: SQLite mailing list
>Subject: Re: [sqlite] Boosting insert and indexing performance for 10
>billion rows (?)
>
>
>>  30 nov. 2018, AJ Miles:
>>
>> Ah, this tool seems very handy. For those curious, I'll paste the
>results
>> below. The index approximately doubles the storage size, but I am
>> intentionally making that tradeoff to avoid the slow down when
>enforcing a
>> unique/primary key on the Reference table while inserting.
>>
>> -AJ
>
>Hello,
>
>is the "integer" column unique and a potential primary key? Then it
>would be surprising that defining it as such causes a slowdown.
>Assuming the table is then defined WITHOUT ROWID. And that the
>multithreading pragma is set.
>
>About sqlite_analyzer: on macOS 10.13.6 I got
>..Tcl.framework/Versions/8.4.. image not found
>After editing the binary file as below, it is alright.
>vi sqlite3_analyzer
>:%s/8\.4/8.5/
>:x
>
>Thanks for the intriguing subject. E. Pasma.
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Boosting insert and indexing performance for 10 billion rows (?)

E.Pasma

> 2 dec. 2018, Keith Medcalf:
>
>
> Well if it is unique and not null, then why not just make it the rowid?  In either case, you would still have to permute the storage tree at insert time if the inserts were not in-order.  So let us compare them shall we:
>
> sqlite> create table x(value INTEGER PRIMARY KEY);
> sqlite> insert into x select random() from generate_series where start=1 and stop=100000000;
> Run Time: real 185.795 user 184.265625 sys 0.343750
>
> sqlite> create table x(value INTEGER PRIMARY KEY) WITHOUT ROWID;
> sqlite> insert into x select random() from generate_series where start=1 and stop=100000000;
> Run Time: real 174.661 user 173.890625 sys 0.000000
>
> sqlite> create table x(value INTEGER NOT NULL);
> sqlite> insert into x select random() from generate_series where start=1 and stop=100000000;
> Run Time: real 20.287 user 20.265625 sys 0.000000
> sqlite> create unique index ix on x(value);
> Run Time: real 211.556 user 207.562500 sys 2.562500
>
> sqlite> create table x(value INTEGER NOT NULL);
> sqlite> create unique index ix on x(value);
> sqlite> insert into x select random() from generate_series where start=1 and stop=100000000;
> Run Time: real 196.719 user 195.437500 sys 0.000000
>
> So, the winner is (in this case, with no other payload) that the INTEGER PRIMARY KEY on a WITHOUT ROWID table is the most time efficient, followed by being the INTEGER PRIMARY KEY of a ROWID table.  Those two also happen to be the most space-efficient as well.  Interestingly it was overall faster to build and maintain the separate index at insert time in this example than to build the index seperately, but not by much.  Note this is for 100,000,000 records processed entitely in RAM in a single transaction ...

Thanks because this speed is what I had expected. Except that creating the index beforehand can be overall faster than adding it after the data is inserted.
Unfortunately, but almost funny, my measurements are almost the reverse of what you see. See below.
Possible clues:
- I tested with a file database, not memory.
- I reduced the number of rows to 10.000.000
- I have only a basic system with 4Gb RAM and a single processor

.timer on
.load series
drop table if exists x;
create table x(value INTEGER PRIMARY KEY);

insert into x select random() from generate_series where start=1 and stop=10000000;
Run Time: real 87.673 user 37.691114 sys 43.527249

drop table x;
create table x(value INTEGER PRIMARY KEY) WITHOUT ROWID;

insert into x select random() from generate_series where start=1 and stop=10000000;
Run Time: real 88.759 user 36.276227 sys 44.190566

drop table x;
create table x(value INTEGER NOT NULL);

insert into x select random() from generate_series where start=1 and stop=10000000;
Run Time: real 3.980 user 2.266869 sys 0.124012

create unique index ix on x(value);
Run Time: real 10.131 user 7.623369 sys 0.797015

drop table x;
create table x(value INTEGER NOT NULL);
create unique index ix on x(value);

insert into x select random() from generate_series where start=1 and stop=10000000;
Run Time: real 107.647 user 41.249081 sys 50.869456


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Boosting insert and indexing performance for 10 billion rows (?)

Simon Slavin-3
On 2 Dec 2018, at 7:29pm, E.Pasma <[hidden email]> wrote:

> drop table x;
> create table x(value INTEGER PRIMARY KEY) WITHOUT ROWID;
>
> insert into x select random() from generate_series where start=1 and stop=10000000;
> Run Time: real 88.759 user 36.276227 sys 44.190566

Realtime is 88.759

> create table x(value INTEGER NOT NULL);
>
> insert into x select random() from generate_series where start=1 and stop=10000000;
> Run Time: real 3.980 user 2.266869 sys 0.124012
>
> create unique index ix on x(value);
> Run Time: real 10.131 user 7.623369 sys 0.797015

Total realtime 14.111

I had not expected that much of a difference.

The long time for the WITHOUT ROWID sample is caused by the index being built as the rows are inserted.  Since the values are added in a random order, the tree is being expanded at random points each time.  In contrast, CREATE UNIQUE INDEX is able to form the index tree in a better-optimised way, since it has all the values it needs available when it starts.

Total
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Boosting insert and indexing performance for 10 billion rows (?)

Keith Medcalf
On Sunday, 2 December, 2018 12:57, Simon Slavin <[hidden email]> wrote:

>On 2 Dec 2018, at 7:29pm, E.Pasma <[hidden email]> wrote:

>> drop table x;
>> create table x(value INTEGER PRIMARY KEY) WITHOUT ROWID;
>>
>> insert into x select random() from generate_series where start=1
>and stop=10000000;
>> Run Time: real 88.759 user 36.276227 sys 44.190566
>
>Realtime is 88.759
>
>> create table x(value INTEGER NOT NULL);
>>
>> insert into x select random() from generate_series where start=1
>and stop=10000000;
>> Run Time: real 3.980 user 2.266869 sys 0.124012
>>
>> create unique index ix on x(value);
>> Run Time: real 10.131 user 7.623369 sys 0.797015
>
>Total realtime 14.111
>
>I had not expected that much of a difference.
>
>The long time for the WITHOUT ROWID sample is caused by the index
>being built as the rows are inserted.  Since the values are added in
>a random order, the tree is being expanded at random points each
>time.  In contrast, CREATE UNIQUE INDEX is able to form the index
>tree in a better-optimised way, since it has all the values it needs
>available when it starts.

Well, yes and no.  CREATE INDEX as a separate step is extracting the keys, sorting them, then building the tree with in-order inserts.  When the key is part of the table (whether with or without rowid) and the inserts are in random order, the tree must be re-balanced much more often (in the worst case on each insert) rather than the case of only per level for in-order inserts.

Most of this can be overcome by simply having a larger cache size to permit (more of) those operations to be performed in RAM rather than putting a continuous I/O load on the system.  

Cache size and the number of threads used for sorting has a huge effect on performance (this time, to an on-disk database on an SSD) with various cache sizes and threads, for 100,000,000 records (a smaller number of records does not produce meaningful differences for me when using bigger configurations):

>sqlite test.db < test.sql
pragma cache_size=1048576;
pragma threads=8;
pragma temp_store=1;
.timer on
---
drop table if exists x;
Run Time: real 0.000 user 0.000000 sys 0.000000
create table x(value INTEGER PRIMARY KEY);
Run Time: real 0.018 user 0.000000 sys 0.000000
insert into x select random() from generate_series where start=1 and stop=100000000;
Run Time: real 190.586 user 185.531250 sys 3.359375
---
drop table x;
Run Time: real 1.468 user 1.437500 sys 0.000000
create table x(value INTEGER PRIMARY KEY) WITHOUT ROWID;
Run Time: real 0.013 user 0.015625 sys 0.000000
insert into x select random() from generate_series where start=1 and stop=100000000;
Run Time: real 180.595 user 176.437500 sys 1.734375
---
drop table x;
Run Time: real 0.946 user 0.906250 sys 0.015625
create table x(value INTEGER NOT NULL);
Run Time: real 0.014 user 0.000000 sys 0.000000
insert into x select random() from generate_series where start=1 and stop=100000000;
Run Time: real 22.569 user 19.703125 sys 2.078125
create unique index ix on x(value);
Run Time: real 49.409 user 136.390625 sys 12.484375
---
drop table x;
Run Time: real 1.940 user 1.875000 sys 0.031250
create table x(value INTEGER NOT NULL);
Run Time: real 0.019 user 0.000000 sys 0.000000
create unique index ix on x(value);
Run Time: real 0.014 user 0.000000 sys 0.000000
insert into x select random() from generate_series where start=1 and stop=100000000;
Run Time: real 207.955 user 202.125000 sys 4.000000
---
.exit

>sqlite test.db < test.sql
pragma cache_size=1048576;
pragma threads=0;
0
pragma temp_store=1;
.timer on
---
drop table if exists x;
Run Time: real 8.282 user 2.515625 sys 5.734375
create table x(value INTEGER PRIMARY KEY);
Run Time: real 0.014 user 0.000000 sys 0.000000
insert into x select random() from generate_series where start=1 and stop=100000000;
Run Time: real 189.177 user 182.656250 sys 3.187500
---
drop table x;
Run Time: real 1.465 user 1.437500 sys 0.000000
create table x(value INTEGER PRIMARY KEY) WITHOUT ROWID;
Run Time: real 0.014 user 0.000000 sys 0.000000
insert into x select random() from generate_series where start=1 and stop=100000000;
Run Time: real 178.972 user 172.265625 sys 2.937500
---
drop table x;
Run Time: real 0.938 user 0.890625 sys 0.000000
create table x(value INTEGER NOT NULL);
Run Time: real 0.014 user 0.000000 sys 0.000000
insert into x select random() from generate_series where start=1 and stop=100000000;
Run Time: real 25.313 user 19.703125 sys 2.812500
create unique index ix on x(value);
Run Time: real 117.165 user 106.921875 sys 6.421875
---
drop table x;
Run Time: real 2.037 user 1.953125 sys 0.046875
create table x(value INTEGER NOT NULL);
Run Time: real 0.027 user 0.000000 sys 0.015625
create unique index ix on x(value);
Run Time: real 0.018 user 0.000000 sys 0.000000
insert into x select random() from generate_series where start=1 and stop=100000000;
Run Time: real 203.974 user 198.265625 sys 3.875000
---
.exit

All else being equal, allowing SQLite to create multiple threads for sorting when creating the index separately has an enormous impact.  So now lets try the same thing with a smaller cache_size (above is 4 GB, below is 1 GB, still absolutely huge in comparison to the default):

>sqlite test.db < test.sql
pragma cache_size=65536;
pragma threads=8;
8
pragma temp_store=1;
.timer on
---
drop table if exists x;
Run Time: real 7.527 user 2.062500 sys 5.437500
create table x(value INTEGER PRIMARY KEY);
Run Time: real 0.014 user 0.000000 sys 0.000000
insert into x select random() from generate_series where start=1 and stop=100000000;
Run Time: real 1044.661 user 277.500000 sys 754.500000
---
drop table x;
Run Time: real 3.567 user 1.656250 sys 1.906250
create table x(value INTEGER PRIMARY KEY) WITHOUT ROWID;
Run Time: real 0.013 user 0.000000 sys 0.000000
insert into x select random() from generate_series where start=1 and stop=100000000;
Run Time: real 995.853 user 259.609375 sys 722.453125
---
drop table x;
Run Time: real 2.967 user 0.828125 sys 2.093750
create table x(value INTEGER NOT NULL);
Run Time: real 0.014 user 0.000000 sys 0.000000
insert into x select random() from generate_series where start=1 and stop=100000000;
Run Time: real 26.038 user 19.046875 sys 4.375000
create unique index ix on x(value);
Run Time: real 46.088 user 132.109375 sys 15.718750
---
drop table x;
Run Time: real 7.166 user 1.765625 sys 5.375000
create table x(value INTEGER NOT NULL);
Run Time: real 0.018 user 0.000000 sys 0.000000
create unique index ix on x(value);
Run Time: real 0.013 user 0.000000 sys 0.000000
insert into x select random() from generate_series where start=1 and stop=100000000;
Run Time: real 1204.187 user 307.687500 sys 880.281250
---
.exit

>sqlite test.db < test.sql
pragma cache_size=65536;
pragma threads=0;
0
pragma temp_store=1;
.timer on
---
drop table if exists x;
Run Time: real 7.482 user 2.640625 sys 4.812500
create table x(value INTEGER PRIMARY KEY);
Run Time: real 0.013 user 0.000000 sys 0.000000
insert into x select random() from generate_series where start=1 and stop=100000000;
Run Time: real 1047.682 user 279.687500 sys 757.875000
---
drop table x;
Run Time: real 3.567 user 1.390625 sys 2.140625
create table x(value INTEGER PRIMARY KEY) WITHOUT ROWID;
Run Time: real 0.014 user 0.000000 sys 0.000000
insert into x select random() from generate_series where start=1 and stop=100000000;
Run Time: real 995.550 user 263.703125 sys 722.578125
---
drop table x;
Run Time: real 2.942 user 0.781250 sys 2.125000
create table x(value INTEGER NOT NULL);
Run Time: real 0.013 user 0.000000 sys 0.000000
insert into x select random() from generate_series where start=1 and stop=100000000;
Run Time: real 25.967 user 19.500000 sys 4.000000
create unique index ix on x(value);
Run Time: real 112.580 user 100.156250 sys 9.718750
---
drop table x;
Run Time: real 7.060 user 2.031250 sys 4.953125
create table x(value INTEGER NOT NULL);
Run Time: real 0.016 user 0.000000 sys 0.000000
create unique index ix on x(value);
Run Time: real 0.017 user 0.000000 sys 0.000000
insert into x select random() from generate_series where start=1 and stop=100000000;
Run Time: real 1202.757 user 305.546875 sys 885.656250
---
.exit




_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users