Tracking database corruption

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Tracking database corruption

Nicholas Lovell-2
When getting sqlite_corrupt when reading from a database, is there a way to know if a particular failure came from reading/initializing a page from a WAL file, or the database disk file?


Specifically I've been seeing a number of failures around btreeInitPage. However when I pull up the saved copies I make whenever corruption occurs and run "pragma integrity_check", its coming up as "ok".


I am not at this time keeping hold of the WAL file though, so wanted to know if I can confirm that the corruption was in the WAL file or if something else might be at fault.


This is using sqlite 3.16.2.

________________________________

This email and any attachments may contain confidential and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments) by others is prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete this email and any attachments. No employee or agent of TiVo Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo Inc. may only be made by a signed written agreement.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Tracking database corruption

Simon Slavin-3
On 6 Nov 2017, at 10:38pm, Nicholas Lovell <[hidden email]> wrote:

> When getting sqlite_corrupt when reading from a database, is there a way to know if a particular failure came from reading/initializing a page from a WAL file, or the database disk file?

Obtain the extended error code to learn more than you already know:

<https://sqlite.org/c3ref/errcode.html>
<https://sqlite.org/rescode.html#extrc>

> Specifically I've been seeing a number of failures around btreeInitPage. However when I pull up the saved copies I make whenever corruption occurs and run "pragma integrity_check", its coming up as "ok".

What exactly are you "saving" ?  The database file ?  Or the WAL file too ?  Or the SHM file too ?

It would probably be useful to figure out what’s causing your corruption.  What PRAGMAs does your program use under normal use ?  Are you using anything clever like shared cache ?  Also, is your database stored on a disk in the computer running SQLite or it is accessed remotely, perhaps across a network ?

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Tracking database corruption

Nicholas Lovell-2
There aren't any extended error codes for sqlite_corrupt. Its just error code 11 "database disk image is malformed".


I have the line numbers where the corruption is getting hit (they vary, but several are in btreeInitPage)


> What exactly are you "saving" ?  The database file ?  Or the WAL file too ?  Or the SHM file too ?

As for what I'm saving off, I do a byte for byte copy on the database files that were used in the statement that detected corruption (this is automated). I'm not copying the "-shm" or "-wal" files (an oversight which I'll fix). I was hoping to figure out if its likely capturing the "-wal" file would find the corruption, or if I could determine that the page its failing on was from the file I have (in which case something else is presumably going wrong).


In terms of pragmas during normal usage: user_version, recursive_triggers, cache_size, mmap_size (max size is set to 15 MiB; which covers the files in question entirely), journal_mode (=wal). The only somewhat recent addition is mmap (along with an upgrade from 3.8.8.3 to 3.16.2).


These databases are stored on a local disk (admittedly using an encrypting loop-back adapter). I doubt that I'll necessarily be able to track down what is causing the corruption given how infrequently it ends up occurring, storing the corrupt copies is my attempt at being able to track them down (since they are very infrequent and automatically "recovered" from).

________________________________
From: sqlite-users <[hidden email]> on behalf of Simon Slavin <[hidden email]>
Sent: Monday, November 6, 2017 3:53:41 PM
To: SQLite mailing list
Subject: Re: [sqlite] Tracking database corruption

On 6 Nov 2017, at 10:38pm, Nicholas Lovell <[hidden email]> wrote:

> When getting sqlite_corrupt when reading from a database, is there a way to know if a particular failure came from reading/initializing a page from a WAL file, or the database disk file?

Obtain the extended error code to learn more than you already know:

<https://sqlite.org/c3ref/errcode.html>
<https://sqlite.org/rescode.html#extrc>

> Specifically I've been seeing a number of failures around btreeInitPage. However when I pull up the saved copies I make whenever corruption occurs and run "pragma integrity_check", its coming up as "ok".

What exactly are you "saving" ?  The database file ?  Or the WAL file too ?  Or the SHM file too ?

It would probably be useful to figure out what’s causing your corruption.  What PRAGMAs does your program use under normal use ?  Are you using anything clever like shared cache ?  Also, is your database stored on a disk in the computer running SQLite or it is accessed remotely, perhaps across a network ?

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

________________________________

This email and any attachments may contain confidential and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments) by others is prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete this email and any attachments. No employee or agent of TiVo Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo Inc. may only be made by a signed written agreement.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Tracking database corruption

Richard Hipp-3
In reply to this post by Nicholas Lovell-2
On 11/6/17, Nicholas Lovell <[hidden email]> wrote:

> When getting sqlite_corrupt when reading from a database, is there a way to
> know if a particular failure came from reading/initializing a page from a
> WAL file, or the database disk file?
>
>
> Specifically I've been seeing a number of failures around btreeInitPage.
> However when I pull up the saved copies I make whenever corruption occurs
> and run "pragma integrity_check", its coming up as "ok".
>
>
> I am not at this time keeping hold of the WAL file though, so wanted to know
> if I can confirm that the corruption was in the WAL file or if something
> else might be at fault.
>

It would be helpful if you could preserve the -wal file too.
--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Tracking database corruption

Richard Hipp-3
In reply to this post by Nicholas Lovell-2
On 11/6/17, Nicholas Lovell <[hidden email]> wrote:
> There aren't any extended error codes for sqlite_corrupt. Its just error
> code 11 "database disk image is malformed".

Additional information is available if you use the error and warning
log feature.  https://www.sqlite.org/errlog.html

--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Tracking database corruption

Simon Slavin-3
In reply to this post by Nicholas Lovell-2


On 7 Nov 2017, at 1:11am, Nicholas Lovell <[hidden email]> wrote:

> In terms of pragmas during normal usage: user_version, recursive_triggers, cache_size, mmap_size (max size is set to 15 MiB; which covers the files in question entirely), journal_mode (=wal). The only somewhat recent addition is mmap (along with an upgrade from 3.8.8.3 to 3.16.2).

Try not using memory mapping.  I don’t have a specific reason to think it’s causing the problem but it’s easy to turn off without modifying much of your source code, and it has caused problems on one platform in the past.

> These databases are stored on a local disk (admittedly using an encrypting loop-back adapter). I doubt that I'll necessarily be able to track down what is causing the corruption given how infrequently it ends up occurring, storing the corrupt copies is my attempt at being able to track them down (since they are very infrequent and automatically "recovered" from).

Encrypting adapters should not be causing a problem.

I forgot to ask whether you’re using simultaneously accessing the database from multiple threads or processes or applications.  If you’re doing this, please give details, including of your disk format, since this can be an easy cause of corruption.  If not, this simplifies things considerably.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Tracking database corruption

Nicholas Lovell-2
I am using the errlog interface, which is why I have line numbers for where its occurring. The location varies with some clusters around

R-47608-56469, and /* Freeblock off the end of the page */, although there are other locations as well.


>Try not using memory mapping.  I don’t have a specific reason to think it’s causing the problem but it’s easy to turn off without modifying much of your source code, and > it has caused problems on one platform in the past.

I only recently turned it on in order to relieve memory pressure from other parts of the system. Given how low the frequency is (something in the thousandths of a percent chance in a given day for a given device [It was ~0.0073%, its currently ~0.0185%), its hard to know when there is a signal vs just noise.

>I forgot to ask whether you’re using simultaneously accessing the database from multiple threads or processes or applications.  If you’re doing this, please give details, including of your disk format, since this can be an easy cause of corruption.  If not, this simplifies things considerably.

I am using multiple threads and processes to access the databases. The library is configured as multi-thread, with the connection handles mutexed by the application. Most statements have their sqlite3_stmt get cached (and these caches are per-connection handle since the prepared statements only work against a given handle).  The page size is configured at 4k (although I suppose that is the default now). The database files grow in 1 MiB increments (using sqlite3_file_control). Most database handles have multiple databases attached (with main being a database with an empty schema).

Having multiple handles per database is in part why switching to mmap saved some memory pressure (since page cache isn't shared without shared_cache).

Disk is formatted as ext3 (linux 2.6.31 kernel; relevant mount options: relatime,errors=continue,barrier=1,data=ordered).  Writes are not actually getting coalesced by the kernel disk scheduling (there are reasons for this, but none of which are relevant for this).



________________________________
From: sqlite-users <[hidden email]> on behalf of Simon Slavin <[hidden email]>
Sent: Tuesday, November 7, 2017 8:43 AM
To: SQLite mailing list
Subject: Re: [sqlite] Tracking database corruption



On 7 Nov 2017, at 1:11am, Nicholas Lovell <[hidden email]> wrote:

> In terms of pragmas during normal usage: user_version, recursive_triggers, cache_size, mmap_size (max size is set to 15 MiB; which covers the files in question entirely), journal_mode (=wal). The only somewhat recent addition is mmap (along with an upgrade from 3.8.8.3 to 3.16.2).

Try not using memory mapping.  I don’t have a specific reason to think it’s causing the problem but it’s easy to turn off without modifying much of your source code, and it has caused problems on one platform in the past.


> These databases are stored on a local disk (admittedly using an encrypting loop-back adapter). I doubt that I'll necessarily be able to track down what is causing the corruption given how infrequently it ends up occurring, storing the corrupt copies is my attempt at being able to track them down (since they are very infrequent and automatically "recovered" from).

Encrypting adapters should not be causing a problem.

I forgot to ask whether you’re using simultaneously accessing the database from multiple threads or processes or applications.  If you’re doing this, please give details, including of your disk format, since this can be an easy cause of corruption.  If not, this simplifies things considerably.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

________________________________

This email and any attachments may contain confidential and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments) by others is prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete this email and any attachments. No employee or agent of TiVo Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo Inc. may only be made by a signed written agreement.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Tracking database corruption

Simon Slavin-3
Given what you wrote, and assuming you are not accidentally writing to memory or file handles maintained by SQLite, your corruption is probably caused by

A) Faulty implementation of memory mapping

B) Multiple thread/processes writing to the database at one time, through some fault in mutexing.

Of the two, (A) is almost trivial to check.  Turn it off, and see whether the fault continues to occur.  I’m aware this will make your program use more time/resources, but it’s just for temporary testing.

Testing (B) is more difficult and requires understanding of your own specific program design.  You should only have to worry about parts of your code which write to the database.  Pay special attention to whether you’re sharing SQLite connections or not.  This is so difficult to debug you might try the other things mentioned here before you investigate this problem.

> Disk is formatted as ext3 (linux 2.6.31 kernel; relevant mount options: relatime,errors=continue,barrier=1,data=ordered).  Writes are not actually getting coalesced by the kernel disk scheduling (there are reasons for this, but none of which are relevant for this).


I’m not aware of problems on ext3 with barrier=1.  However, you might consider trying errors=panic for a while.  This is something I recommend for all servers since it allows you to identify faults far more quickly than trying to reverse-engineer corruption reports.

In closing, please note that a good proportion of the data-corruption problems reported here turn out to be caused by hardware problems.  Some are in motherboard data channels, others in a storage subsystem.  Consider that all your code may be faultless, SQLite may be faultless, and that the problem is in hardware.

You might learn something more from section 6 of

<https://sqlite.org/lockingv3.html>

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users