Known reasons why sqlite3_open_v2 can take over 60s on windows?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Known reasons why sqlite3_open_v2 can take over 60s on windows?

Andrzej Fiedukowicz
Hi mailing list gurus!
I will start with TL;DR version as this may be enough for some of you:
  * We are trying to investigate an issue that we see in diagnostic data
of our C++ product.
  * The issue was pinpointed to be caused by timeout on
`sqlite3_open_v2` which supposedly takes over 60s to complete (we only
give it 60s).
  * We tried multiple different configurations, but never were able to
reproduce even 5s delay on this call.

So the question is if maybe there are some known scenarios in which
`sqlite3_open_v2` can take that long (on windows)?

Now to the details:
  * We are using version `3.10.2` of SQLite. We went through changelogs
from this version till now and nothing we've found in bugfixes section
seems to suggest that there was some issue that was addressed in
consecutive sqlite releases and may have caused our problem.
  * The issue we see affects around 0.1% unique user across all
supported versions of windows (Win 7, Win 8, Win 10). There are no
manual user complains/reports about that - this can suggest that problem
happens in the context where something serious enough is happening with
user machine/system that he doesn't expect anything to work. So
something that indicates system wide failure is a valid possibility as
long as it can possibly happen for 0.1% of random windows users.
  * There are no data indicating that the same issue ever occurred on
Mac which is also supported platform with large enough sample of
diagnostic data.
  * We are using Poco (https://github.com/pocoproject/poco, version:
1.7.2) as a tool for accessing our SQLite database, but we've analyzed
the Poco code and it seems that failure on this code level can only
(possibly) explain ~1% of all collected samples. This is how we've
determined that problem lies in `sqlite3_open_v2` taking long time.
  * This happens on both `DELETE` journal mode as well as on `WAL`.
  * It seems like after this problem happens first time for a particular
user each consecutive call to `sqlite3_open_v2` takes that long until
user restarts whole application (possibly machine, no way to tell from
our data).
  * We are using following flags setup for `sqlite3_open_v2` (as in Poco):

  > sqlite3_open_v2(..., ..., SQLITE_OPEN_READWRITE | SQLITE_OPEN_CREATE
| SQLITE_OPEN_URI, NULL);

  * This usually doesn't happen on startup of the application so it's
not likely to be caused by something happening while our application is
not running. This includes power cuts offs causing data destruction
(which tend to return SQLITE_CORRUPT anyway, as mentioned in:
https://www.sqlite.org/howtocorrupt.html).
  * We were never able to reproduce this issue locally even though we
tried different things:
    ** Multiple threads writing and reading from DB with synchronization
required by particular journaling system.
    ** Keeping sqlite connection open for long time and working on db
normally in a mean while.
    ** Trying to hit HDD hard with other data (dumping /dev/rand (WSL)
to multiple files from different processes while accessing DB normally).
    ** Trying to force antivirus software to scan db on every file
access (tested with Avast with basically everything enabled including
"scan on open" and "scan on write").
    ** Breaking our internal synchronization required by particular
journaling systems.
    ** Calling WinAPI CreateFile with all possible combinations of file
sharing options on db file - this caused issues but `sqlite3_open_v2`
always returned fast - just with error.
    ** Calling WinAPI LockFile on random parts of DB file which is btw.
nice way of reproducing `SQLITE_IOERR`, but no luck with reproducing the
discussed issue.
    ** Some additional attempts to actually stretch Poco layer and
double check if our static analysis of codes are right.
  * We've tried to look for similar issues online but anything somewhat
relevant we've found was here
http://sqlite.1065341.n5.nabble.com/sqlite3-open-v2-performance-degrades-as-number-of-opens-increase-td37482.html 
. This doesn't seem to explain our case though, as the numbers of
parallel connections are way beyond what we have as well as what would
typical windows user have (unless there is some somewhat popular app
exploiting sqlite which we don't know about).

Do you have any ideas what can cause that issue?

Maybe some hints what else should we check or what additional diagnostic
data that we can collect from users would be useful to pinpoint the real
reason why that happens?

Thanks in advance :)
Andrzej 'Yester' Fiedukowicz

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Known reasons why sqlite3_open_v2 can take over 60s on windows?

Richard Hipp-3
On 7/31/18, Andrzej Fiedukowicz <[hidden email]> wrote:
>   * The issue was pinpointed to be caused by timeout on
> `sqlite3_open_v2` which supposedly takes over 60s to complete (we only
> give it 60s).

Are you sure the time is happening on sqlite3_open_v2()?  Because that
routine just opens the file descriptors but never reads from the
database files.  Perhaps Poco is doing something else (like setting
some PRAGMAs or something) after the sqlite3_open_v2() call?

Anyhow - if the database needs recovery, the first thing that SQLite
will do will run recovery.  And if the database is large and the
transaction(s) that are being recovered are large, and if your machine
is slow, then I suppose that could take a long time.  Even so, 60
milliseconds seems like a more reasonable number than 60 seconds.  I
don't know what is going on.  I wish you could reproduce the problem.

AV software is also notorious for causing problems on Windows
machines.  That might also be a problem.

--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Known reasons why sqlite3_open_v2 can take over 60s on windows?

R Smith-2
In reply to this post by Andrzej Fiedukowicz
On 2018/07/31 5:02 PM, Andrzej Fiedukowicz wrote:
> Hi mailing list gurus!
> I will start with TL;DR version as this may be enough for some of you:
>  * We are trying to investigate an issue that we see in diagnostic
> data of our C++ product.
>  * The issue was pinpointed to be caused by timeout on
> `sqlite3_open_v2` which supposedly takes over 60s to complete (we only
> give it 60s).
>  * We tried multiple different configurations, but never were able to
> reproduce even 5s delay on this call.

The only thing that comes to mind (others may know of more scenarios) is
that there was a hot journal file that needed to be rolled back. If
SQLite (or the host application) crashes in the midst of a transaction,
there will remain a journal file, which depending on the Journal mode,
will contain either the uncommitted or committed bits (I see you tried
both WAL and DELETE, so I assume the difference doesn't matter).
Rolling back a hot journal can take some time, granted, I haven't seen
many >60s time-frames, but I have seen several seconds, which means it
might be the case[*]. And if you limit the time to n where the journal
rollback cannot complete by n time, then every time you start the
program, you will probably have the hot journal again, and the rollback
starting again, which could explain the phenomenon where once triggered,
the time-out occurs constantly on re-starts - although, a computer
re-start shouldn't fix this (unless perhaps a cache-holdup is
responsible for slow IO before a restart or such, but that's just
guesswork).

Lastly, version 3.10 is ancient. You may well have researched that
change-logs for fixes, but I don't think this is a bug in SQLite, hence
not seeing any fixes. If the rollback is allowed enough time to
complete, it should get fixed, that said, if you DID update SQLite, the
simple fact that it runs probably around 50% faster than it used to back
at 3.10[**] might already have saved your system (in that the time-out
may have sufficed on a faster DB Engine).

[*] -  I'm also not 100% that the open_v2 thing itself does the
detecting & rollback of hot journals, but I think so.
[**] - I'm just ball-park guessing here, I think Richard posted some
time-lines at some point with speed increases over versions, but I don't
have it handy.


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users