Best practices for huge batch inserts with WAL?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Best practices for huge batch inserts with WAL?

Jens Alfke-2
I’m importing a large data set with a lot of rows — an entire Wikipedia dump, about 60GB, one article per row — into a brand new SQLite database in WAL mode. What’s the fastest way to import it?

I started with one big transaction, but noted that (of course) the WAL file was growing rapidly while the main database file stayed tiny. I figured this would become inefficient, so I stopped the run and adjusted my code to commit and re-open a transaction every 10,000 rows.

With that, the import started quickly, but as time went on the commits were taking longer and longer, so the process was spending most of its time committing. (I wasn’t able to finish the job, as it ran into an unrelated fatal error in my code about ⅔ of the way through.)

Would it have been faster to use a single transaction? Even if the commit at the end is epic-length, it wouldn’t be rewriting the b-tree nodes over and over again. If so, would periodic WAL checkpoints help?

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Best practices for huge batch inserts with WAL?

John Found
Do you have some indices on the table? Drop them before the huge insert and then recreate again. It is faster.

On Mon, 24 Apr 2017 11:00:00 -0700
Jens Alfke <[hidden email]> wrote:

> I’m importing a large data set with a lot of rows — an entire Wikipedia dump, about 60GB, one article per row — into a brand new SQLite database in WAL mode. What’s the fastest way to import it?
>
> I started with one big transaction, but noted that (of course) the WAL file was growing rapidly while the main database file stayed tiny. I figured this would become inefficient, so I stopped the run and adjusted my code to commit and re-open a transaction every 10,000 rows.
>
> With that, the import started quickly, but as time went on the commits were taking longer and longer, so the process was spending most of its time committing. (I wasn’t able to finish the job, as it ran into an unrelated fatal error in my code about ⅔ of the way through.)
>
> Would it have been faster to use a single transaction? Even if the commit at the end is epic-length, it wouldn’t be rewriting the b-tree nodes over and over again. If so, would periodic WAL checkpoints help?
>
> —Jens
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


--
http://fresh.flatassembler.net
http://asm32.info
John Found <[hidden email]>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Best practices for huge batch inserts with WAL?

Scott Hess
In reply to this post by Jens Alfke-2
What I'd do:

1) Look at the indices, and make sure the input is sorted to insert in
index order.  Also drop any unnecessary indices and add them back at the
end.  [Read the code for vacuum to see what kinds of things make sense to
defer.]

2) Bump up the cache a lot.  Inserting in sorted order makes this less
essential, but for a one-off like this it can't hurt.

3) Turn synchronous off while loading, and commit to starting over if you
get an OS-level crash.  Even with synchronous off, things should be
correctly ordered for app-level crashes.

4) Maybe use exclusive locking?  That wouldn't have a huge impact if you're
batching inserts, I expect.

---

WRT #3, you could also consider journal_mode off (or memory, if your code
requires transactions to work right).  In that case, the database state is
indeterminate if you have an app-level crash, but you should be fine if you
make it to the end.

WRT #1, I would consider fts as an index for these purposes, but it may
require schema changes to make is possible to selectively disable/enable
the indexing.  See https://sqlite.org/fts5.html#external_content_tables .
I'm not sure there would be much gain from disabling fts when loading,
though, as long as your bulk transactions are large.

-scott


On Mon, Apr 24, 2017 at 11:00 AM, Jens Alfke <[hidden email]> wrote:

> I’m importing a large data set with a lot of rows — an entire Wikipedia
> dump, about 60GB, one article per row — into a brand new SQLite database in
> WAL mode. What’s the fastest way to import it?
>
> I started with one big transaction, but noted that (of course) the WAL
> file was growing rapidly while the main database file stayed tiny. I
> figured this would become inefficient, so I stopped the run and adjusted my
> code to commit and re-open a transaction every 10,000 rows.
>
> With that, the import started quickly, but as time went on the commits
> were taking longer and longer, so the process was spending most of its time
> committing. (I wasn’t able to finish the job, as it ran into an unrelated
> fatal error in my code about ⅔ of the way through.)
>
> Would it have been faster to use a single transaction? Even if the commit
> at the end is epic-length, it wouldn’t be rewriting the b-tree nodes over
> and over again. If so, would periodic WAL checkpoints help?
>
> —Jens
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Best practices for huge batch inserts with WAL?

Clemens Ladisch
Scott Hess wrote:
> WRT #3, you could also consider journal_mode off (or memory, if your code
> requires transactions to work right).  In that case, the database state is
> indeterminate if you have an app-level crash, but you should be fine if you
> make it to the end.

It would be a better idea to change to the rollback journal mode (DELETE/
TRUNCATE/PERSIST) - it's guaranteed to work correctly in all situations,
and with a small database to be filled, the journal cannot become large
because it stores only _old_ data.


Regards,
Clemens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users