How to increase performance when inserting a lot of small data into table using indices

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

How to increase performance when inserting a lot of small data into table using indices

mailing lists
I have the following situation:

- in one table relatively small data is inserted (100 bytes per record)
- this table contains three indices
- about 100 million or more records have to be inserted

Insertion really slows down after about 100 000 items have been inserted. I suppose that the slow down is related to indexing because:

a) removing the indices brings the speed up
b) it does not matter whether using a solid state drive or a conventional one (the overall speed differs but not the phenomenon itself)
c) changing the cache size has only a minor impact

So, the best solution I found so far is to disable indexing while insertion and to index the table afterwards (this is magnitudes faster than insertion with indexes). Are there any better solutions or other tricks I might try (splitting table into a data and an index part)?

BTW: I am using journal_mode DELETE. WAL mode only delays the problem and increases a bit the speed but not significantly.

Regards,
Hartwig



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to increase performance when inserting a lot of small data into table using indices

Rob Willett
Hartwig,

You have got most of the tricks we know about. Other more experienced
developers may provide a better insight.

We had to moved about 60GB of table data about and we ended up doing
what you have done with one extra bit, we batched the jobs up in
multiples of 10,000 between BEGIN and END to make transactions out of
them. It's not clear if you are doing that.

Rob

On 10 Sep 2019, at 16:02, mailing lists wrote:

> I have the following situation:
>
> - in one table relatively small data is inserted (100 bytes per
> record)
> - this table contains three indices
> - about 100 million or more records have to be inserted
>
> Insertion really slows down after about 100 000 items have been
> inserted. I suppose that the slow down is related to indexing because:
>
> a) removing the indices brings the speed up
> b) it does not matter whether using a solid state drive or a
> conventional one (the overall speed differs but not the phenomenon
> itself)
> c) changing the cache size has only a minor impact
>
> So, the best solution I found so far is to disable indexing while
> insertion and to index the table afterwards (this is magnitudes faster
> than insertion with indexes). Are there any better solutions or other
> tricks I might try (splitting table into a data and an index part)?
>
> BTW: I am using journal_mode DELETE. WAL mode only delays the problem
> and increases a bit the speed but not significantly.
>
> Regards,
> Hartwig
>
>
>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to increase performance when inserting a lot of small data into table using indices

Richard Hipp-3
In reply to this post by mailing lists
On 9/10/19, mailing lists <[hidden email]> wrote:

> So, the best solution I found so far is to disable indexing while insertion
> and to index the table afterwards

I think that is the best solution.  Be sure to also do all of your
inserts (and the CREATE INDEX statements) inside of a transaction.

--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to increase performance when inserting a lot of small data into table using indices

mailing lists
Hi,

I cannot really put all the inserts into one transaction because in case of a failure I loose all the already inserted data. Though I made some tests. There is hardly any performance gain anymore when doing 1000 or 10 000 insertions in one transaction including immediate insertion into indices (in my case the difference is in the per cent range).

What is the background that index creation is so much faster than insertion using indices? Once I heard something about fragmentation but on solid state disks fragmentation should not play a role as long as indices and data are separated, are they?

Regards,
Hartwig


> Am 2019-09-10 um 17:16 schrieb Richard Hipp <[hidden email]>:
>
> On 9/10/19, mailing lists <[hidden email]> wrote:
>
>> So, the best solution I found so far is to disable indexing while insertion
>> and to index the table afterwards
>
> I think that is the best solution.  Be sure to also do all of your
> inserts (and the CREATE INDEX statements) inside of a transaction.
>
> --
> D. Richard Hipp
> [hidden email]
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Re: How to increase performance when inserting a lot of small data into table using indices

Hick Gunter
While a write transaction is open, SQLite needs to keep the changed pages in memory. When the size of a transaction (measured in changed pages) exceeds the available memory, SQLite starts to spill the transaction to disk. The optimal transaction size would be just before this occurs, but there is no interface available to determine this, so the number of records is a commonly used proxy.

Creating indexes after insert is faster because the writes are localized to the index pages, so you get more logical inserts per disk write and file system buffering and read ahead have more cache hits too.

-----Ursprüngliche Nachricht-----
Von: sqlite-users [mailto:[hidden email]] Im Auftrag von mailing lists
Gesendet: Dienstag, 10. September 2019 17:26
An: SQLite mailing list <[hidden email]>
Betreff: [EXTERNAL] Re: [sqlite] How to increase performance when inserting a lot of small data into table using indices

Hi,

I cannot really put all the inserts into one transaction because in case of a failure I loose all the already inserted data. Though I made some tests. There is hardly any performance gain anymore when doing 1000 or 10 000 insertions in one transaction including immediate insertion into indices (in my case the difference is in the per cent range).

What is the background that index creation is so much faster than insertion using indices? Once I heard something about fragmentation but on solid state disks fragmentation should not play a role as long as indices and data are separated, are they?

Regards,
Hartwig


> Am 2019-09-10 um 17:16 schrieb Richard Hipp <[hidden email]>:
>
> On 9/10/19, mailing lists <[hidden email]> wrote:
>
>> So, the best solution I found so far is to disable indexing while
>> insertion and to index the table afterwards
>
> I think that is the best solution.  Be sure to also do all of your
> inserts (and the CREATE INDEX statements) inside of a transaction.
>
> --
> D. Richard Hipp
> [hidden email]
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


___________________________________________
 Gunter Hick | Software Engineer | Scientific Games International GmbH | Klitschgasse 2-4, A-1130 Vienna | FN 157284 a, HG Wien, DVR: 0430013 | (O) +43 1 80100 - 0

May be privileged. May be confidential. Please delete if not the addressee.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to increase performance when inserting a lot of small data into table using indices

Richard Hipp-3
In reply to this post by mailing lists
On 9/10/19, mailing lists <[hidden email]> wrote:

> What is the background that index creation is so much faster than insertion
> using indices?

Indexes are maintained in key order.  So inserts are happening at
random spots all across the index.  For each insert, the system has to
(1) read a 4096-byte page, (2) update the 10 or 20 bytes corresponding
to the index entry, and finally (3) write the 4096-byte page.

If you keep pages in cache, you might get lucky in update two or three
entries on each page before you have to write it back.  But even then,
you are doing a lot of I/O relative to the amount of content you are
changing.  Your I/O is hundreds of times more than the amount of
content being updated.

CREATE INDEX does an external merge source on the keys, such that keys
are always written in ascending order.  Hence, the pattern is (1) fill
up a 4096-byte page with new entries, then (2) write the page.  The
I/O to content change ratio is now close to 1.0, which is what you
want.  About a hundred times less I/O than random inserts.

The fact that all pages are written in order also helps, as
filesystems tend to be optimized for that case.

--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to increase performance when inserting a lot of small data into table using indices

Keith Medcalf
In reply to this post by mailing lists

On Tuesday, 10 September, 2019 09:26, mailing lists <[hidden email]> wrote:

>I cannot really put all the inserts into one transaction because in case of
>a failure I loose all the already inserted data. Though I made some tests.
>There is hardly any performance gain anymore when doing 1000 or 10 000
>insertions in one transaction including immediate insertion into indices
>(in my case the difference is in the per cent range).

>What is the background that index creation is so much faster than insertion
>using indices? Once I heard something about fragmentation but on solid
>state disks fragmentation should not play a role as long as indices and
>data are separated, are they?

When you "create" an index as a single operation, you scan the table to collect the key data, do a sort, and then do an in-order insertion into the B-Tree, and then write out the entire tree all at once (it is a single transaction).

When the index already exists, you have to "fiddle" with the B-Tree for each record because it is no longer being built in order.  You have to split and combine pages and shuffle the data about as each record is inserted into the index.  The more records that can be inserted per transaction the less I/O will be required (and you have to have a big enough cache).

If you can pre-sort the records so that they are inserted in-order and use a relatively large number of records inserted per transaction, then there will be very little difference between the two.  Of course, if you have multiple indexes then pre-sorting into the most efficient insertion order is a high art (and often not entirely possible anyway).

--
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.




_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to increase performance when inserting a lot of small data into table using indices

Simon Slavin-3
In reply to this post by mailing lists
On 10 Sep 2019, at 4:02pm, mailing lists <[hidden email]> wrote:

> Insertion really slows down after about 100 000 items have been inserted. I suppose that the slow down is related to indexing because:
>
> [...]
> c) changing the cache size has only a minor impact

SQLite speed does degrade with table size, but it degrades gracefully.  Depending on which OS you're using, you have at least three caches in play.  A sudden change in speed indicates that you just bust one of those caches, as you wrote.  But in this case it's probably the OS's memory cache assigned to your program.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to increase performance when inserting a lot of small data into table using indices

Valentin Davydov-2
In reply to this post by mailing lists
On Tue, Sep 10, 2019 at 05:25:38PM +0200, mailing lists wrote:
> Hi,
>
> I cannot really put all the inserts into one transaction because in case of a failure I loose all the already inserted data. Though I made some tests. There is hardly any performance gain anymore when doing 1000 or 10 000 insertions in one transaction including immediate insertion into indices (in my case the difference is in the per cent range).

What do you mean "to loose data"? Do you need them to be immediately available
via SQL, or just written to persistent storage? In the latter case you can
implement your own data cache, like sequentilal log files, which will be
periodically (and/or on demand) rotated, and afterwards asynchronously parsed,
inserted into the SQLite database with optimized CACHE_SIZE, transaction size,
journal mode etc, and deleted only after successfull commit. Thus you shift the
burden from SQL to filesystem which is less limited by natural data structure
and might perform better.

Valentin Davydov.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to increase performance when inserting a lot of small data into table using indices

Keith Medcalf
In reply to this post by mailing lists

On Tuesday, 10 September, 2019 09:26, mailing lists <[hidden email]> wrote:

>I cannot really put all the inserts into one transaction because in case of
>a failure I loose all the already inserted data.

Why is that important?  Cannot you just load it again from whence it came in the first place on failure?  

>There is hardly any performance gain anymore when doing 1000 or 10 000
>insertions in one transaction including immediate insertion into indices
>(in my case the difference is in the per cent range).

I find that the difference between (a) "inserting all the data into a table with indexes in a single transaction"; (b) "inserting all the data into a table without indexes in a single transaction and then create the indexes"; and, (d) "within a single transaction drop the indexes, insert all the data, then create the indexes" is:

(b) is about 10% faster than (a)
(c) is about 40& faster than (a)

smaller batch sizes result in more random I/O and performance decreases as the batch size decreases.

--
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to increase performance when inserting a lot of small data into table using indices

mailing lists
Hi,

> Am 2019-09-12 um 10:55 schrieb Keith Medcalf <[hidden email]>:
>
>
> On Tuesday, 10 September, 2019 09:26, mailing lists <[hidden email]> wrote:
>
>> I cannot really put all the inserts into one transaction because in case of
>> a failure I loose all the already inserted data.
>
> Why is that important?  Cannot you just load it again from whence it came in the first place on failure?  

the application also runs on mobile devices and the insertion (excluding indexing) also takes a couple of minutes. This means that the user might switch to another app during insertion and - in the worst case - the system is just terminating my background app. So, all the insertion is lost in case of a single transaction. This scenario should be prevented.

>
>> There is hardly any performance gain anymore when doing 1000 or 10 000
>> insertions in one transaction including immediate insertion into indices
>> (in my case the difference is in the per cent range).
>
> I find that the difference between (a) "inserting all the data into a table with indexes in a single transaction"; (b) "inserting all the data into a table without indexes in a single transaction and then create the indexes"; and, (d) "within a single transaction drop the indexes, insert all the data, then create the indexes" is:
>
> (b) is about 10% faster than (a)
> (c) is about 40& faster than (a)
>
> smaller batch sizes result in more random I/O and performance decreases as the batch size decreases.
>
> --
> The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.
>
>
>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to increase performance when inserting a lot of small data into table using indices

Jens Alfke-2


> On Sep 12, 2019, at 12:33 PM, mailing lists <[hidden email]> wrote:
>
> the application also runs on mobile devices and the insertion (excluding indexing) also takes a couple of minutes. This means that the user might switch to another app during insertion and - in the worst case - the system is just terminating my background app. So, all the insertion is lost in case of a single transaction.

[Off-topic]

Your app should not be getting terminated without notice! Use the platform's APIs for requesting background time. I'm not an Android programmer, but here's what to do on iOS:

When notified the app is being backgrounded, call UIApplication.beginBackgroundTask. The OS will give you more time to run. More importantly, you'll be notified when the OS wants you to quit, and can clean up (i.e. commit the transaction).

To guard against conditions where the OS needs to terminate the app while it's in the foreground (low memory or low battery), implement the UIApplicationDelegate method applicationWillTerminate and perform the same cleanup there.

Note that you won't get notifications like this if your DB processing is blocking the main thread's runloop. To prevent that, do the work on a background thread, or at least put the runloop in a modal state and periodically run it for an iteration.

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users