Huge WAL log

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
26 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Huge WAL log

Jan Slodicka
I understand that the WAL log uses less efficient storage rules than the real database, but this case was a real surprise for me. Here is the brief description.

We start from an empty database, create a few tables (each having a few indexes), then begin a transaction, do a couple of inserts into the first table, commit the transaction and repeat the same with the remaining tables.

No other active readers or writers.

Summary of this process:
- DB size 1.022 GB
- WAL log size 7.490 GB
- SHM file size 57 MB (About 7mil pages, which - page size is 1024 - corresponds to the WAL size.)
- Total no. of records inserted is 5.207 mil.
- Among the tables there is one that dominates - it takes about 80% of all records. The commit of this table took over 30 min.
- The test was done on a rather old W7 notebook. Memory consumption approached 1 GB (as opposed to the normal state ~100 MB).

I understand that we have to modify above algorithm by cutting the transaction into smaller pieces. The official SQLite documentation only says "WAL does not work well for very large transactions. For transactions larger than about 100 megabytes...".

Could somebody quantify this statement? While the official documentation mentions the size (megabytes), my gut feeling is that it has more to do with the record count. (In our case large records are rather exceptional.)

Do table indexes play an important role? Wouldn't be better to create them after all inserts are done?






Reply | Threaded
Open this post in threaded view
|

Re: Huge WAL log

Richard Hipp-3
On 1/14/15, Jan Slodicka <[hidden email]> wrote:

> I understand that the WAL log uses less efficient storage rules than the
> real
> database, but this case was a real surprise for me. Here is the brief
> description.
>
> We start from an empty database, create a few tables (each having a few
> indexes), then begin a transaction, do a couple of inserts into the first
> table, commit the transaction and repeat the same with the remaining
> tables.
>
> No other active readers or writers.

Are you sure?  Because the WAL file should reset automatically after a
commit when the size goes above a threshold (1000 pages, by default).
This can fail if you (1) do something to turn off the automatic
checkpoint mechanism or (2) you are holding a read transaction open
that prevents the WAL file from resetting.  Usually the cause of your
problem is (2).  So, please double-check to make sure you are not
accidently starting and leaving a transaction open.  note that a
prepared statement (an sqlite3_stmt object) that has been stepped a
few times but never reset (sqlite3_reset()) or finalized
(sqlite3_finalize()) will hold a read transaction open, because it has
no way of knowing whether or not you might call sqlite3_step() again.
Are you sure you don't have a stray unfinished prepared statement in
your program?

>
> Summary of this process:
> - DB size 1.022 GB
> - WAL log size 7.490 GB
> - SHM file size 57 MB (About 7mil pages, which - page size is 1024 -
> corresponds to the WAL size.)
> - Total no. of records inserted is 5.207 mil.
> - Among the tables there is one that dominates - it takes about 80% of all
> records. The commit of this table took over 30 min.
> - The test was done on a rather old W7 notebook. Memory consumption
> approached 1 GB (as opposed to the normal state ~100 MB).
>
> I understand that we have to modify above algorithm by cutting the
> transaction into smaller pieces. The official SQLite documentation only
> says
> "WAL does not work well for very large transactions. For transactions
> larger
> than about 100 megabytes...".
>
> Could somebody quantify this statement? While the official documentation
> mentions the size (megabytes), my gut feeling is that it has more to do
> with
> the record count. (In our case large records are rather exceptional.)
>
> Do table indexes play an important role? Wouldn't be better to create them
> after all inserts are done?
>
>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://sqlite.1065341.n5.nabble.com/Huge-WAL-log-tp79991.html
> Sent from the SQLite mailing list archive at Nabble.com.
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>


--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Huge WAL log

Jan Slodicka
Richard Hipp-3 wrote
> No other active readers or writers.

Are you sure?
Writers for sure.

As far readers are concerned, the things are too complex to make an absolute statement. (I shall check once more.) However, I can add a few observations I made:

WAL file size was about 70 MB (as reported by the OS) until the critical table started.

The OS started to report 7GB after the commit of the critical table finished.

The commit itself took more than 30 min. After the commit the DB size grew by several 100MB. What else could explain this except moving data from WAL to DB? (I.e. WAL reset.)

Afterwards several other tables were written (total number of records close to 1 mil), but the WAL file did not grow anymore.

After the last table finished, the application was responsive nearly instantly. (No big data movement between WAL and DB.)

Finally, the application closed without any delay while deleting the WAL file. (Apparently the WAL file did not contain large amount of unsaved data.)



> Because the WAL file should reset automatically after a commit...

To make sure that I understand: You mean moving a file pointer, not shrinking of the file itself? (AFAIK, WAL file never shrinks.)



P.S. We still use v3.7.15.2. (Would like to upgrade shortly.)

P.P.S. I have a copy of all SQLite files. If it helped I could try to read WAL file manually.


Reply | Threaded
Open this post in threaded view
|

Re: Huge WAL log

Richard Hipp-3
On 1/14/15, Jan Slodicka <[hidden email]> wrote:

> Richard Hipp-3 wrote
>>> No other active readers or writers.
>>
>> Are you sure?
>
> Writers for sure.
>
> As far readers are concerned, the things are too complex to make an
> absolute
> statement. (I shall check once more.) However, I can add a few observations
> I made:
>
> WAL file size was about 70 MB (as reported by the OS) until the critical
> table started.

What is your page size?

>
> The OS started to report 7GB after the commit of the critical table
> finished.

Your original post said you inserted two rows for each transaction.
How big are those two rows?

>
> The commit itself took more than 30 min. After the commit the DB size grew
> by several 100MB. What else could explain this except moving data from WAL
> to DB? (I.e. WAL reset.)
>
> Afterwards several other tables were written (total number of records close
> to 1 mil), but the WAL file did not grow anymore.
>
> After the last table finished, the application was responsive nearly
> instantly. (No big data movement between WAL and DB.)
>
> Finally, the application closed without any delay while deleting the WAL
> file. (Apparently the WAL file did not contain large amount of unsaved
> data.)
>
>
>
>> Because the WAL file should reset automatically after a commit...
>
> To make sure that I understand: You mean moving a file pointer, not
> shrinking of the file itself? (AFAIK, WAL file never shrinks.)
>
>
>
> P.S. We still use v3.7.15.2. (Would like to upgrade shortly.)
>
> P.P.S. I have a copy of all SQLite files. If it helped I could try to read
> WAL file manually.
>
>
>
>
>
>
> --
> View this message in context:
> http://sqlite.1065341.n5.nabble.com/Huge-WAL-log-tp79991p79993.html
> Sent from the SQLite mailing list archive at Nabble.com.
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>


--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Huge WAL log

Simon Slavin-3
In reply to this post by Jan Slodicka

On 14 Jan 2015, at 3:36pm, Jan Slodicka <[hidden email]> wrote:

> - WAL log size 7.490 GB

Please repeat your tests but as the first command after opening your database file issue

PRAGMA journal_size_limit = 1000000

With this change the WAL file may still grow to 7 GB while that particular transaction is being executed but should be reduced in size after the transaction is completed.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Huge WAL log

Dan Kennedy-4
In reply to this post by Jan Slodicka
On 01/15/2015 12:28 AM, Jan Slodicka wrote:
> Richard Hipp-3 wrote
>>> No other active readers or writers.
>> Are you sure?
> Writers for sure.
>
> As far readers are concerned, the things are too complex to make an absolute
> statement. (I shall check once more.)

Some APIs that might be helpful:

   * sqlite3_get_autocommit() returns 0 if you have an explicit
transaction open.
   * sqlite3_next_stmt() can be used to iterate through all statements
belonging to a db connection.
   * sqlite3_stmt_busy() can be used to determine if a statement has
been stepped but not reset (and so may be holding open an implicit
transaction).

https://www.sqlite.org/c3ref/get_autocommit.html
https://www.sqlite.org/c3ref/next_stmt.html
https://www.sqlite.org/c3ref/stmt_busy.html

Dan.





> However, I can add a few observations
> I made:
>
> WAL file size was about 70 MB (as reported by the OS) until the critical
> table started.
>
> The OS started to report 7GB after the commit of the critical table
> finished.
>
> The commit itself took more than 30 min. After the commit the DB size grew
> by several 100MB. What else could explain this except moving data from WAL
> to DB? (I.e. WAL reset.)
>
> Afterwards several other tables were written (total number of records close
> to 1 mil), but the WAL file did not grow anymore.
>
> After the last table finished, the application was responsive nearly
> instantly. (No big data movement between WAL and DB.)
>
> Finally, the application closed without any delay while deleting the WAL
> file. (Apparently the WAL file did not contain large amount of unsaved
> data.)
>
>
>
>> Because the WAL file should reset automatically after a commit...
> To make sure that I understand: You mean moving a file pointer, not
> shrinking of the file itself? (AFAIK, WAL file never shrinks.)
>
>
>
> P.S. We still use v3.7.15.2. (Would like to upgrade shortly.)
>
> P.P.S. I have a copy of all SQLite files. If it helped I could try to read
> WAL file manually.
>
>
>
>
>
>
> --
> View this message in context: http://sqlite.1065341.n5.nabble.com/Huge-WAL-log-tp79991p79993.html
> Sent from the SQLite mailing list archive at Nabble.com.
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Huge WAL log

Jan Slodicka
In reply to this post by Richard Hipp-3
Richard Hipp-3 wrote
What is your page size?
1024

Richard Hipp-3 wrote
Your original post said you inserted two rows for each transaction.
How big are those two rows?
Sorry for misleading information. Here is a more formal algorithm:

foreach table
{
    BEGIN
    insert all downloaded rows
    COMMIT
}

No. of downloaded rows per table is unknown. In this extreme case it was 3.5+ mil records.

I understand that the WAL log must take a lot of space. What I don't understand is that it was 7x larger than the resulting DB size. (Actual quotient is even larger because I compared to the DB size that contained also other tables.)


Reply | Threaded
Open this post in threaded view
|

Re: Huge WAL log

Jan Slodicka
In reply to this post by Simon Slavin-3
Simon Slavin-3 wrote
> - WAL log size 7.490 GB

Please repeat your tests but as the first command after opening your database file issue

PRAGMA journal_size_limit = 1000000

With this change the WAL file may still grow to 7 GB while that particular transaction is being executed but should be reduced in size after the transaction is completed.
Thanks for the tip. I'll do so.
Reply | Threaded
Open this post in threaded view
|

Re: Huge WAL log

Jan Slodicka
I'll add the results from additional tests.

First of all, I forced a commit after each 100,000 records inserted into a single table. (A complication for us.)

Some numbers for a table with a single index and 3,423,000 inserted records:

Intermediate commits took subsequently 764 msec, 2164 msec, 4462... 38532 msec. In other words the commit time more or less regularly increased. I attribute this to the index building.
Total time spent in all commits is 730 secs.

Index rebuild (using sqlite3 shell) took 123 sec. This would suggest that it might be better to run with deleted indexes and rebuild them at the end.

WAL size started at 65MB and slightly increased once upon a time ending finally at 177MB. Very good.

DB size increased by roughly 17-18K after each commit. This suggests that WAL needs 10x more memory than the DB itself.

Other large tables yielded similar results as above.

I made one additional observation worth of reporting.

After an hour or so the debugger (not the application) stopped working leaving thus active WAL. Running SELECT COUNT(*) on currently processed table took 44 secs. However, after the WAL was commited, the same select took 2 secs.



Reply | Threaded
Open this post in threaded view
|

Re: Huge WAL log

Simon Slavin-3

On 15 Jan 2015, at 3:44pm, Jan Slodicka <[hidden email]> wrote:

> Index rebuild (using sqlite3 shell) took 123 sec. This would suggest that it
> might be better to run with deleted indexes and rebuild them at the end.

That is as expected, and is standard advice for cases where you are adding huge numbers of rows.

> WAL size started at 65MB and slightly increased once upon a time ending
> finally at 177MB. Very good.

Yes, that's better.  I assume you set a journal_size_limit value.

However, other information in your message suggests that you have a resource leak of some type somewhere.  Especially, it should not take 12 minutes to insert 3.5M rows into a simple table with an index or two unless really long strings or blobs are involved.

Unfortunately, I'm only really familiar with the C and PHP interfaces to SQLite.  But in both of those you can check the result code of each API call to make sure it is SQLITE_OK.  Are you able to do this with whatever interface you're using ?

> DB size increased by roughly 17-18K after each commit. This suggests that
> WAL needs 10x more memory than the DB itself.


Very variable.  Depends on whether the changes in one transaction change many different pages or change fewer different pages multiple times.  At least, I think so.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Huge WAL log

Paul Sanderson
In reply to this post by Jan Slodicka
> I understand that the WAL log must take a lot of space. What I don't
> understand is that it was 7x larger than the resulting DB size. (Actual
> quotient is even larger because I compared to the DB size that contained
> also other tables.)

Unlike a rollback journal a WAL file can have multiple copies of the same page.

So from yor main loop, expanding the following code may help us understand.

"insert all downloaded rows"

If your inserted records is 5million separate insertions then each
insertion could result in a table leaf page being written to the WAL
file. Each insertion could potentially also result in one (or more)
table interior pages being written to the WAL. Subsequent insertions
could result in multiple copies of the same page.

I don't know when any index would be updated - whether at each record
update, or when the transaction is committed. But if the former then
there could be a relatively large number of index leaf and interior
pages. This could become signiifcant if an index is built on a large
string field

What is the schema for this table and what indexes are on it?

The above is based on my knowledge of the file formats and what I have
seen examining these files, not on the workings/code in SQLite so
perhaps Dr Hipp could correct me if I'm wrong.


Paul
www.sandersonforensics.com
skype: r3scue193
twitter: @sandersonforens
Tel +44 (0)1326 572786
http://sandersonforensics.com/forum/content.php?195-SQLite-Forensic-Toolkit
-Forensic Toolkit for SQLite
http://sandersonforensics.com/forum/content.php?168-Reconnoitre - VSC
processing made easy
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Huge WAL log

Jan Slodicka
In reply to this post by Simon Slavin-3
Simon Slavin-3 wrote
However, other information in your message suggests that you have a resource leak of some type somewhere.  Especially, it should not take 12 minutes to insert 3.5M rows into a simple table with an index or two unless really long strings or blobs are involved.

Unfortunately, I'm only really familiar with the C and PHP interfaces to SQLite.  But in both of those you can check the result code of each API call to make sure it is SQLITE_OK.  Are you able to do this with whatever interface you're using ?
We use c# API originally inspired by system.data.sqlite library. Every single call is checked.
Some of the numbers I reported were obtained with a customized version of sqlite3 shell.

These 12 min (730 secs) refer to the total time of 35 individual commits. It is the result of processing 7 GB of data. On the other hand, the processor load was very low all the time indicating that the disk might be a bit slow. (Although I did not observe any other slow down...)

Simon Slavin-3 wrote
> DB size increased by roughly 17-18K after each commit. This suggests that
> WAL needs 10x more memory than the DB itself.

Very variable.  Depends on whether the changes in one transaction change many different pages or change fewer different pages multiple times.  At least, I think so.
Sure, it is variable. But my goal when I opened this discussion was to discuss the worst possible case. That generates the user complaints, normal cases when everything runs smoothly are of no interest.

I consider this a very important information that should be present in the official WAL documentation.

Reply | Threaded
Open this post in threaded view
|

Re: Huge WAL log

Jan Slodicka
In reply to this post by Paul Sanderson
Paul Sanderson wrote
So from yor main loop, expanding the following code may help us understand.

"insert all downloaded rows"

The code schema is as follows:

foreach table
{
    BEGIN
    INSERT INTO table VALUES(....)
    INSERT INTO table VALUES(....)
    ...
    COMMIT
}

Large column values are supplied as parameters, the rest (vast majority) is passed through SQL command.

Paul Sanderson wrote
If your inserted records is 5million separate insertions
What alternative would you suggest?

Paul Sanderson wrote
What is the schema for this table and what indexes are on it?
This is the first table where I noticed the problem:

CREATE TABLE [discounttype]
(
        [createdon] DATETIME NULL,
        [description] NTEXT NULL,
        [discounttypeid] UNIQUEIDENTIFIER NOT NULL CONSTRAINT PK_discounttype PRIMARY KEY ROWGUIDCOL DEFAULT (newid()),
        [isamounttype] BIT NOT NULL DEFAULT(0),
        [modifiedon] DATETIME NULL,
        [name] NVARCHAR(200) NULL COLLATE NOCASE
);
CREATE INDEX [FK_discounttype_name] ON [discounttype](name COLLATE NOCASE);

We use custom NOCASE collation.

Also, we use encrypted database. This slows down normal DB operations by a few percent.
Reply | Threaded
Open this post in threaded view
|

Re: Huge WAL log

R Smith

On 2015/01/16 11:33, Jan Slodicka wrote:

>
> The code schema is as follows:
>
> foreach table
> {
>      BEGIN
>      INSERT INTO table VALUES(....)
>      INSERT INTO table VALUES(....)
>      ...
>      COMMIT
> }
>
> Large column values are supplied as parameters, the rest (vast majority) is
> passed through SQL command.

When you say "passed through SQL command", you mean using sqlite3_execute()?

If so, why not bind them all? That will be significantly faster...

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Huge WAL log

Jan Slodicka
In reply to this post by Paul Sanderson
Paul Sanderson wrote
Unlike a rollback journal a WAL file can have multiple copies of the same page.

So from yor main loop, expanding the following code may help us understand.

"insert all downloaded rows"

If your inserted records is 5million separate insertions then each
insertion could result in a table leaf page being written to the WAL
file. Each insertion could potentially also result in one (or more)
table interior pages being written to the WAL. Subsequent insertions
could result in multiple copies of the same page.
Maybe you got me onto the right track...

I'll start with the record size estimation:

CREATE TABLE [discounttype]
(
        [createdon] DATETIME NULL,
        [description] NTEXT NULL,
        [discounttypeid] UNIQUEIDENTIFIER NOT NULL CONSTRAINT PK_discounttype PRIMARY KEY ROWGUIDCOL DEFAULT (newid()),
        [isamounttype] BIT NOT NULL DEFAULT(0),
        [modifiedon] DATETIME NULL,
        [name] NVARCHAR(200) NULL COLLATE NOCASE
);

The record will be rather small. It consists of
2xDATETIME - usu. 38 By
UNIQUEIDENTIFIER - 32 By
+ 2 rather short strings

Altogether I wouldn't be surprised if  the page size (1024) is about 7x larger than the average record length. This (together with your explanation) would explain extreme WAL size.

---------------------------

Thanks to your post I discovered multiple-row inserts so that I now understand what you asked.

I found this nice article that (although for SQL server) proves that multi-inserts can greatly speed up bulk inserts. I am quite confident that most of arguments brought by that article will apply also to our case. (sqlite used from c#)

Will have to test it thoroughly and (if successful) modify our application code.

Reply | Threaded
Open this post in threaded view
|

Re: Huge WAL log

Simon Slavin-3

On 16 Jan 2015, at 12:39pm, Jan Slodicka <[hidden email]> wrote:

> Thanks to your post I discovered multiple-row inserts so that I now
> understand what you asked.

Just a note that multiple-row inserts were added to SQLite relatively recently (2012-03-20 (3.7.11)) and, because SQLite does only database-level locking, its overhead for INSERTs is far less than that of SQL Server.  It might be faster to use them but I would expect it to be so much faster than many inserts as part of one transaction.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Huge WAL log

Jan Slodicka
In reply to this post by R Smith
RSmith wrote
>
> The code schema is as follows:
>
> foreach table
> {
>      BEGIN
>      INSERT INTO table VALUES(....)
>      INSERT INTO table VALUES(....)
>      ...
>      COMMIT
> }
>
> Large column values are supplied as parameters, the rest (vast majority) is
> passed through SQL command.

When you say "passed through SQL command", you mean using sqlite3_execute()?

If so, why not bind them all? That will be significantly faster...
Not so in c# wrapper, where the overhead of binding is considerable.
Reply | Threaded
Open this post in threaded view
|

Re: Huge WAL log

Jan Slodicka
In reply to this post by Simon Slavin-3
Simon Slavin-3 wrote
> Thanks to your post I discovered multiple-row inserts so that I now
> understand what you asked.

Just a note that multiple-row inserts were added to SQLite relatively recently (2012-03-20 (3.7.11)) and, because SQLite does only database-level locking, its overhead for INSERTs is far less than that of SQL Server.  It might be faster to use them but I would expect it to be so much faster than many inserts as part of one transaction.
I made a fast, perhaps oversimplified, TestA:

Create an empty database. (file based)
CREATE TABLE Test (city NVARCHAR(120) NULL COLLATE NOCASE)
BEGIN
Then I inserted N records using commands such as
INSERT INTO Test VALUES('_random_'). (_random was a random 8 character long string.)
COMMIT
Measure WAL size
Close the DB
Measure DB size

TestB was performed with the same data except the records were grouped
INSERT INTO Test VALUES('_random_'), VALUES('_random1')...

I tested different groupings (1,2,5,10) and different N values (100000 - 2500000).

Results:
- The more records are grouped, the faster.
- Grouping of 10 records was more than 2x faster than no grouping at all.
- WAL size did not depend on the grouping used and was just slightly larger than the DB size.

Then I modified the test by adding an index on the single column. I run 2 sets of tests - one where the index was created before first insert and the one with the index created after all inserts finished.

Results:
- Active index: WAL size ~ DB size
- Inactive index: WAL size ~ 50% of the DB size
- Tests with an active index were slower by 15-20%

Conclusion:
Switching off the indexing during a bulk insert brings minor advantages.
Multi-row inserts may bring larger advantages.
The reason of the "huge WAL problem" remains unclear.

Note:
The tests were programmed in c# code that called native SQLite.dll and were run on a W7 desktop. The results may not apply to other environments.

Reply | Threaded
Open this post in threaded view
|

Re: Huge WAL log

Dan Kennedy-4
On 01/17/2015 12:04 AM, Jan Slodicka wrote:

> Simon Slavin-3 wrote
>>> Thanks to your post I discovered multiple-row inserts so that I now
>>> understand what you asked.
>> Just a note that multiple-row inserts were added to SQLite relatively
>> recently (2012-03-20 (3.7.11)) and, because SQLite does only
>> database-level locking, its overhead for INSERTs is far less than that of
>> SQL Server.  It might be faster to use them but I would expect it to be so
>> much faster than many inserts as part of one transaction.
> I made a fast, perhaps oversimplified, TestA:
>
> Create an empty database. (file based)
> CREATE TABLE Test (city NVARCHAR(120) NULL COLLATE NOCASE)
> BEGIN
> Then I inserted N records using commands such as
> INSERT INTO Test VALUES('_random_'). (_random was a random 8 character long
> string.)
> COMMIT
> Measure WAL size
> Close the DB
> Measure DB size
>
> TestB was performed with the same data except the records were grouped
> INSERT INTO Test VALUES('_random_'), VALUES('_random1')...
>
> I tested different groupings (1,2,5,10) and different N values (100000 -
> 2500000).
>
> Results:
> - The more records are grouped, the faster.
> - Grouping of 10 records was more than 2x faster than no grouping at all.
> - WAL size did not depend on the grouping used and was just slightly larger
> than the DB size.
>
> Then I modified the test by adding an index on the single column. I run 2
> sets of tests - one where the index was created before first insert and the
> one with the index created after all inserts finished.
>
> Results:
> - Active index: WAL size ~ DB size
> - Inactive index: WAL size ~ 50% of the DB size
> - Tests with an active index were slower by 15-20%
>
> Conclusion:
> Switching off the indexing during a bulk insert brings minor advantages.
> Multi-row inserts may bring larger advantages.
> The reason of the "huge WAL problem" remains unclear.


Is it correct that you have a single transaction inserting lots of data
into a table with multiple indexes on it? Something like 1GB?

When an SQL write transaction is performed, SQLite begins by modifying
database pages within its internal page-cache. If the transaction
modifies only a few pages, all dirty pages stay in the cache until the
user executes "COMMIT", at which point they are appended to the *-wal
file. However, the page-cache is of limited size (by default 2000
pages), and once it is completely full of dirty pages SQLite begins
appending them to the *-wal file mid-transaction in order to free up
space. At this point each time SQLite needs to modify a page that is not
already in the cache it must select a dirty page to write out to the
*-wal file so as to free up space to load the new page into the cache
where it can be modified. This means that a transaction with a large
working set may append more than one copy of a single page to the *-wal
file. Maybe many, many copies.

SQLite indexes are b-trees. Each b-tree node is stored on a database
page. So if you're inserting keys in random order into a large index
(one too large to fit entirely within the page-cache), then virtually
all inserts result in an existing dirty page being flushed from the
cache and appended to the *-wal file.

To avoid this it seems like there are two options - make the page-cache
large enough to hold all the indexes or insert keys into the indexes in
sorted order. Inserting keys in sorted order avoids the problem because
all writes go to the right-most leaf node of the index b-tree, which
will almost always be present in the page-cache.

To change the size of the page-cache, see the "PRAGMA cache_size" and
"PRAGMA page_size" commands.

One way to get keys to be inserted in order is to create all indexes
after populating the table. SQLite sorts the data before creating the
index b-tree in this case.

The other is to create a temp (or non temp) table with *exactly the
same* columns and indexes as the table to be populated and insert the
new rows into it. Then running:

   INSERT INTO target_tbl SELECT * FROM temp_tbl;

In this case, SQLite detects the similar schemas and copies keys in
sorted order from the indexes on "temp_tbl" to the corresponding index
on "target_tbl".

Dan.







>
> Note:
> The tests were programmed in c# code that called native SQLite.dll and were
> run on a W7 desktop. The results may not apply to other environments.
>
>
>
>
>
> --
> View this message in context: http://sqlite.1065341.n5.nabble.com/Huge-WAL-log-tp79991p80070.html
> Sent from the SQLite mailing list archive at Nabble.com.
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Huge WAL log

Keith Medcalf
In reply to this post by Jan Slodicka

Some data using python+apsw for 10,000,000 records:

So, embeding the values in the SQL (ie, as text statements) and preparing a "new" statement each time is the least efficient,
Multiple Values that are embedded in the SQL are more efficient, up to about 50/100 values per statement (50% faster)
Preparing a single statement and using the same statement to insert with the data being bound, 1 at a time, is even more efficient.
The last case (list of prepeared bindings), is not significantly different that the "one at a time" bound insert.

No significant difference for WAL journalling.

Generated 10000000 random values in 50.9100000858 seconds
Inserted 10000000 records individually parsed in 154.308000088 seconds
Inserted 10000000 in groups of 2 in 158.332999945 seconds
Inserted 10000000 in groups of 5 in 104.661999941 seconds
Inserted 10000000 in groups of 10 in 86.7520000935 seconds
Inserted 10000000 in groups of 25 in 76.3989999294 seconds
Inserted 10000000 in groups of 50 in 73.9539999962 seconds
Inserted 10000000 in groups of 100 in 72.7170000076 seconds
Inserted 10000000 in groups of 500 in 76.751999855 seconds
Inserted 10000000 in groups of 1000 in 77.3090000153 seconds
Inserted 10000000 in groups of 5000 in 76.0599999428 seconds
Inserted 10000000 in groups of 10000 in 74.4079999924 seconds
Inserted 10000000 records via binding in 59.3770000935 seconds
Inserted 10000000 records via single binding list in 53.5610001087 seconds

Generated by:

import random
import time
import apsw

recs = 10000000
db = apsw.Connection('test.db')
cr = db.cursor()

chars = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
data = set()

start = time.time()
while len(data) < recs:
    item = ''
    for j in xrange(8):
        item += chars[int(random.random() * 52)]
    data.add(item)
print 'Generated', recs, 'random values in', time.time() - start, 'seconds'

data = list(data)
mdata = [(x,) for x in data]

cr.execute('drop table if exists test; create table test ( x text unique ); vacuum;')

start = time.time()
cr.execute('begin')
for i in xrange(recs):
    cr.execute("insert into test values ('%s')" % (data[i],))
cr.execute('commit')
print 'Inserted', recs, 'records individually parsed in', time.time() - start, 'seconds'

for steps in [2, 5, 10, 25, 50, 100, 500, 1000, 5000, 10000]:
    cr.execute('drop table if exists test; create table test ( x text unique ); vacuum;')
    start = time.time()
    cr.execute('begin')
    for i in xrange(0, recs, steps):
        sql = "insert into test values ('" + "'),('".join(data[i:i+steps]) + "')"
        cr.execute(sql)
    cr.execute('commit')
    print 'Inserted', recs, 'in groups of', steps, 'in', time.time() - start, 'seconds'

cr.execute('drop table if exists test; create table test ( x text unique ); vacuum;')

start = time.time()
cr.execute('begin')
for i in xrange(recs):
    cr.execute("insert into test values (?)", (data[i],))
cr.execute('commit')
print 'Inserted', recs, 'records via binding in', time.time() - start, 'seconds'

cr.execute('drop table if exists test; create table test ( x text unique ); vacuum;')

start = time.time()
cr.execute('begin')
cr.executemany("insert into test values (?)", mdata)
cr.execute('commit')
print 'Inserted', recs, 'records via single binding list in', time.time() - start, 'seconds'


---
Theory is when you know everything but nothing works.  Practice is when everything works but no one knows why.  Sometimes theory and practice are combined:  nothing works and no one knows why.

>-----Original Message-----
>From: [hidden email] [mailto:sqlite-users-
>[hidden email]] On Behalf Of Jan Slodicka
>Sent: Friday, 16 January, 2015 10:04
>To: [hidden email]
>Subject: Re: [sqlite] Huge WAL log
>
>Simon Slavin-3 wrote
>>> Thanks to your post I discovered multiple-row inserts so that I now
>>> understand what you asked.
>>
>> Just a note that multiple-row inserts were added to SQLite relatively
>> recently (2012-03-20 (3.7.11)) and, because SQLite does only
>> database-level locking, its overhead for INSERTs is far less than that
>of
>> SQL Server.  It might be faster to use them but I would expect it to be
>so
>> much faster than many inserts as part of one transaction.
>
>I made a fast, perhaps oversimplified, TestA:
>
>Create an empty database. (file based)
>CREATE TABLE Test (city NVARCHAR(120) NULL COLLATE NOCASE)
>BEGIN
>Then I inserted N records using commands such as
>INSERT INTO Test VALUES('_random_'). (_random was a random 8 character
>long
>string.)
>COMMIT
>Measure WAL size
>Close the DB
>Measure DB size
>
>TestB was performed with the same data except the records were grouped
>INSERT INTO Test VALUES('_random_'), VALUES('_random1')...
>
>I tested different groupings (1,2,5,10) and different N values (100000 -
>2500000).
>
>Results:
>- The more records are grouped, the faster.
>- Grouping of 10 records was more than 2x faster than no grouping at all.
>- WAL size did not depend on the grouping used and was just slightly
>larger
>than the DB size.
>
>Then I modified the test by adding an index on the single column. I run 2
>sets of tests - one where the index was created before first insert and
>the
>one with the index created after all inserts finished.
>
>Results:
>- Active index: WAL size ~ DB size
>- Inactive index: WAL size ~ 50% of the DB size
>- Tests with an active index were slower by 15-20%
>
>Conclusion:
>Switching off the indexing during a bulk insert brings minor advantages.
>Multi-row inserts may bring larger advantages.
>The reason of the "huge WAL problem" remains unclear.
>
>Note:
>The tests were programmed in c# code that called native SQLite.dll and
>were
>run on a W7 desktop. The results may not apply to other environments.
>
>
>
>
>
>--
>View this message in context: http://sqlite.1065341.n5.nabble.com/Huge-
>WAL-log-tp79991p80070.html
>Sent from the SQLite mailing list archive at Nabble.com.
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
12