Re: Thinking about a way to extend the number of writers in WAL mode

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Thinking about a way to extend the number of writers in WAL mode

Simon Slavin-3


On 4 Aug 2017, at 11:43am, Luc DAVID <[hidden email]> wrote:

> sqlite was not designed for this kind of access but It would be great to have a higher level of concurrency

The problem with these things is that you have SQLite trying to read the minds of he programmer and user.  Consider two operations being done by different computers at the same time:

UPDATE contacts SET phone = REPLACE (phone, '444', '555')

INSERT INTO contacts VALUES ('Charles', '444 1234')

The first one is done because an entire phone exchange got renumbered to make way for a new exchange.  The second is a new contact.  But the two operations were in overlapping transactions.  When they’re both finished should the new contact have '444' or '555' ?

Here’s another scenario.  Suppose you have an invoice file so you can invoice those contacts, and the key of the invoice file is the rowid of the contact file.  Two users each want to create an invoice for a new contact at the same time.  The software running on their computers needs to insert the new contact, then find the rowid of the new contact so it can create an invoice for it.  How can you arrange this when the transactions can overlap ?

The problem you’re trying to fix is one of the big problems with distributed databases.  Nobody has found a good solution for it yet.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Thinking about a way to extend the number of writers in WAL mode

Jens Alfke-2

> On Aug 4, 2017, at 11:28 AM, Nico Williams <[hidden email]> wrote:
>
> Imagine a mode where there is only a WAL, and to checkpoint is to write
> a new WAL with only live contents and... rename(2) into place.

What you’re describing is exactly how CouchDB’s storage engine works, as well as descendants like Couchbase Server’s CouchStore and ForestDB. (Note: I work for Couchbase.)

Efficient lookups in a file like this require the existence of a bunch of extraneous metadata like interior B-tree nodes. This metadata changes all the time as records are written*, so a lot of it has to be written out too along with every transaction, resulting in substantial write amplification.

The other big drawback is that compaction (the checkpoint step you describe) is very expensive in terms of I/O. I’ve known of CouchDB systems that took many hours to compact their databases, and since every write that occurs during a compaction has to be replayed onto the new file after the copy before compaction completes, one can get into a state where a busy database either never actually finishes compacting, or has to temporarily block all writers just so it can get the damn job done without interruption. (It’s a similar problem to GC thrash.)

We’ve also seen that, on low-end hardware like mobile devices, I/O bandwidth is limited enough that a running compaction can really harm the responsiveness of the _entire OS_, as well as cause significant battery drain.

—Jens

* Modifying/rewriting a single record requires rewriting the leaf node that points to it, which requires rewriting the parent node that points to the leaf, and this ripples all the way up to the root node.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Thinking about a way to extend the number of writers in WAL mode

Simon Slavin-3


On 8 Aug 2017, at 6:30pm, Jens Alfke <[hidden email]> wrote:

> We’ve also seen that, on low-end hardware like mobile devices, I/O bandwidth is limited enough that a running compaction can really harm the responsiveness of the _entire OS_, as well as cause significant battery drain.

Yes.  Cannot stress enough that you don’t need VACUUM for efficient storage.  It locks the entire database, it monopolises access to storage, and it does many writes to storage which means it’ll wear through Flash storage cycles.  What used to be the great advantage of VACUUM — defragmentation — was useful for hard disks but does not give any advantage for Flash storage.

The only advantages are to save filespace immediately after deleting data or indexes.  You might want VACUUM if you delete data, then take a copy of your database for backup.  But most SQLite databases are small enough that this doesn’t matter.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Thinking about a way to extend the number of writers in WAL mode

Nico Williams
In reply to this post by Jens Alfke-2
On Tue, Aug 08, 2017 at 10:30:33AM -0700, Jens Alfke wrote:
> > On Aug 4, 2017, at 11:28 AM, Nico Williams <[hidden email]> wrote:
> > Imagine a mode where there is only a WAL, and to checkpoint is to write
> > a new WAL with only live contents and... rename(2) into place.
>
> What you’re describing is exactly how CouchDB’s storage engine works,
> as well as descendants like Couchbase Server’s CouchStore and
> ForestDB. (Note: I work for Couchbase.)

Also how ZFS works.

> Efficient lookups in a file like this require the existence of a bunch
> of extraneous metadata like interior B-tree nodes. This metadata
> changes all the time as records are written*, so a lot of it has to be
> written out too along with every transaction, resulting in substantial
> write amplification.

Yes.  It's called write magnification.  ZFS deals with this by first
committing to an intent log (ZIL) without writing all those interior
nodes, then it eventually writes a proper transaction.

One could do the same sort of thing for a single-file DB: write a number
of transactions as intents, then once in a while "checkpoint" them by
writing a b-tree transaction that includes all those updates.  For
readers this means always processing all intent log entries since the
last b-tree-updating transaction.

LSMs are... a lot lik this, IIUC.  Are they not?

> The other big drawback is that compaction (the checkpoint step you
> describe) is very expensive in terms of I/O. I’ve known of CouchDB
> systems that took many hours to compact their databases, and since
> every write that occurs during a compaction has to be replayed onto
> the new file after the copy before compaction completes, one can get
> into a state where a busy database either never actually finishes
> compacting, or has to temporarily block all writers just so it can get
> the damn job done without interruption. (It’s a similar problem to GC
> thrash.)

There's definitely trade-offs.  This is already the case for SQLite3's
WAL.  You have to checkpoint often, but when you do you lose read
concurrency.  When you value read concurrency highly you might tune the
WAL max size up to reduce checkpoint frequence, and now you slow down
checkpointing.

Another thing that can be done is to write to the "next" DB as soon as
possible, possibly synchronously with writes to the "current" DB.  Then
the checkpoint process is very simple: write the "close" TX to the
"current" DB, rename the "next" into place, create the next "next".

> We’ve also seen that, on low-end hardware like mobile devices, I/O
> bandwidth is limited enough that a running compaction can really harm
> the responsiveness of the _entire OS_, as well as cause significant
> battery drain.

Yes.  But there you don't really need read concurrency, so SQLite3
without a WAL (or with a WALL but frequent checkpointing) will suffice.

> * Modifying/rewriting a single record requires rewriting the leaf node
> that points to it, which requires rewriting the parent node that
> points to the leaf, and this ripples all the way up to the root node.

I'm well aware :)  Others on the list might not, naturally.

One of the very nice features of an append-only single-file DB file
format is that you can just "tail -f" (well, a binary tail -f) to
replicate.  If you add in the intent log concept, it's not so bad in
terms of I/O, but you slow down readers somewhat.

Another very nice feature of an append-only single-file DB file is high
read concurrency.  And yet another is fast recovery (throw away a
truncated transaction and "checkpoint").

Nico
--
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users