Theoretical write performance for low-throughput devices

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Theoretical write performance for low-throughput devices

Jonathan Brandmeyer
Or, how many times is each page written by SQLite for an insert-heavy
test?  The answer appears to be "4", but I can only account for two of
those four.

I'm working on an embedded system that uses a log-structured
filesystem on raw NAND flash.  This is not your typical workstation's
managed flash (SATA/NVMe), or portable managed flash (SD/USB).  It's a
bare-nekkid ONFI-speaking chip.  All reads and writes are one 2kB page
at a time.  There is no readahead, and no write buffering by the
driver or filesystem for page-sized writes.

We got the following performance numbers out of the flash storage:

Streaming reads through the filesystem: 7.5 MB/s.
Streaming writes through the filesystem: 5.4 MB/s.
Single insert performance through SQLite: 0.2 MB/s.
Bulk insert performance through SQLIte: 1.3 MB/s, asymptotic for very
large transactions.

I do expect the single-insert performance to be much lower than the
bulk insert performance.

We are using the WAL, and this benchmark includes the cost of a
checkpoint at the end.  Futziing with the WAL autocheckpoint size has
little impact for smaller WAL sizes.  My working assumption is that
using the WAL for an insertion-heavy workload would consume roughly
half of my available throughput due to checkpoint writes.  Indeed, if
the autocheckpoint level is raised high enough that the entire
benchmark fits in the WAL, then I do observe that the write throughput
asymptotically approaches 2.6 MB/s instead of 1.3 MB/s.

That leaves one more factor of two somewhere.

The table in question has the schema:

```
CREATE TABLE IF NOT EXISTS `chunks` (
    `target_id`    INTEGER NOT NULL,
    `chunk_num`        INTEGER NOT NULL,
    `chunk_blob`    BLOB NOT NULL,
    PRIMARY KEY(`target_id`,`chunk_num`)
);
```

Other factors that might help understand our workload:
Blobs are a little less than 1 kB each, and we're using the default DB
page size (4 kB).  So I would expect that SQLite would pack about 3
rows per page, leaving some extra for primary keys, field delimiters
and other metadata.  I understand that the composite primary key
implies an index table that goes with the blobs, which implies some
inherent write amplification to account for the index.  Still, my
expectation is that the write throughput added by the index should be
close to the size of the key columns, not the blob columns.  So 2x
still seems too high.

Any other ideas?

Thanks,
--
Jonathan Brandmeyer
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Theoretical write performance for low-throughput devices

Richard Hipp-3
On 10/21/19, Jonathan Brandmeyer <[hidden email]> wrote:
> Or, how many times is each page written by SQLite for an insert-heavy
> test?  The answer appears to be "4", but I can only account for two of
> those four.

Are you measuring writes at the OS-interface layer, or writes at the
hardware layer?  SQLite should issue no more than 2x writes in WAL
mode, and less if the same page is modified more than once.  But who
knows how the OS might be multiplying this to accomplish its own
filesystem consistency.

Consider compiling the
https://www.sqlite.org/src/file/ext/misc/vfslog.c extension into
SQLite on a workstation and running your code there, to get more
details about everything that SQLite is doing with storage.

--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Theoretical write performance for low-throughput devices

Jonathan Brandmeyer
On Mon, Oct 21, 2019 at 10:31 AM Richard Hipp <[hidden email]> wrote:

>
> On 10/21/19, Jonathan Brandmeyer <[hidden email]> wrote:
> > Or, how many times is each page written by SQLite for an insert-heavy
> > test?  The answer appears to be "4", but I can only account for two of
> > those four.
>
> Are you measuring writes at the OS-interface layer, or writes at the
> hardware layer?  SQLite should issue no more than 2x writes in WAL
> mode, and less if the same page is modified more than once.  But who
> knows how the OS might be multiplying this to accomplish its own
> filesystem consistency.
>
> Consider compiling the
> https://www.sqlite.org/src/file/ext/misc/vfslog.c extension into
> SQLite on a workstation and running your code there, to get more
> details about everything that SQLite is doing with storage.

Thanks for the pointer.  I didn't do this exactly, but instead
instrumented SQLite's unixWrite() and my lowest-level NAND driver's
block write function on the target hardware to capture summary
counters of both events.  The filesystem's own consistency overhead
adds up to about 2% additional writes - not nearly enough to account
for the missing 2x.  The throughput implied by the low-level counters
is consistent with the benchmark results.

Thanks,
--
Jonathan Brandmeyer
PlanetiQ
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Theoretical write performance for low-throughput devices

Richard Hipp-3
On 10/21/19, Jonathan Brandmeyer <[hidden email]> wrote:

> On Mon, Oct 21, 2019 at 10:31 AM Richard Hipp <[hidden email]> wrote:
>>
>> On 10/21/19, Jonathan Brandmeyer <[hidden email]> wrote:
>> > Or, how many times is each page written by SQLite for an insert-heavy
>> > test?  The answer appears to be "4", but I can only account for two of
>> > those four.
>>
>> Are you measuring writes at the OS-interface layer, or writes at the
>> hardware layer?  SQLite should issue no more than 2x writes in WAL
>> mode, and less if the same page is modified more than once.  But who
>> knows how the OS might be multiplying this to accomplish its own
>> filesystem consistency.
>>
>> Consider compiling the
>> https://www.sqlite.org/src/file/ext/misc/vfslog.c extension into
>> SQLite on a workstation and running your code there, to get more
>> details about everything that SQLite is doing with storage.
>
> Thanks for the pointer.  I didn't do this exactly, but instead
> instrumented SQLite's unixWrite() and my lowest-level NAND driver's
> block write function on the target hardware to capture summary
> counters of both events.  The filesystem's own consistency overhead
> adds up to about 2% additional writes - not nearly enough to account
> for the missing 2x.  The throughput implied by the low-level counters
> is consistent with the benchmark results.
>

What happens if you set "PRAGMA synchronous=OFF".  (I'm not suggesting
that you do this in production - it is just an experiment to try to
help figure out what is going on.)

--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Theoretical write performance for low-throughput devices

Jonathan Brandmeyer
On Mon, Oct 21, 2019 at 5:00 PM Richard Hipp <[hidden email]> wrote:

>
> On 10/21/19, Jonathan Brandmeyer <[hidden email]> wrote:
> > On Mon, Oct 21, 2019 at 10:31 AM Richard Hipp <[hidden email]> wrote:
> >>
> >> On 10/21/19, Jonathan Brandmeyer <[hidden email]> wrote:
> >> > Or, how many times is each page written by SQLite for an insert-heavy
> >> > test?  The answer appears to be "4", but I can only account for two of
> >> > those four.
> >>
> >> Are you measuring writes at the OS-interface layer, or writes at the
> >> hardware layer?  SQLite should issue no more than 2x writes in WAL
> >> mode, and less if the same page is modified more than once.  But who
> >> knows how the OS might be multiplying this to accomplish its own
> >> filesystem consistency.
> >>
> >> Consider compiling the
> >> https://www.sqlite.org/src/file/ext/misc/vfslog.c extension into
> >> SQLite on a workstation and running your code there, to get more
> >> details about everything that SQLite is doing with storage.
> >
> > Thanks for the pointer.  I didn't do this exactly, but instead
> > instrumented SQLite's unixWrite() and my lowest-level NAND driver's
> > block write function on the target hardware to capture summary
> > counters of both events.  The filesystem's own consistency overhead
> > adds up to about 2% additional writes - not nearly enough to account
> > for the missing 2x.  The throughput implied by the low-level counters
> > is consistent with the benchmark results.
> >
>
> What happens if you set "PRAGMA synchronous=OFF".

No significant change.  The target filesystem only caches non-aligned
writes, so there usually isn't anything for it to do on fsync anyway.

> (I'm not suggesting
> that you do this in production - it is just an experiment to try to
> help figure out what is going on.)

Understood.

If it helps, we are doing a couple of other things that are dissimilar
to a typical workstation configuration.  We are using the builtin
"unix-none" VFS, the -DSQLITE_DEFAULT_WAL_SYNCHRONOUS=1 and
-DSQLITE_DEFAULT_LOCKING_MODE=1 compile-time options, along with a
design rules that there are zero or one database connection per file,
and only one thread accesses that connection at a time.  The target
lacks mmap(), posix file locks, and multiple address spaces, so this
seemed like the right settings to use.

Thanks,
--
Jonathan Brandmeyer
PlanetiQ
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Theoretical write performance for low-throughput devices

Richard Hipp-3
On 10/21/19, Jonathan Brandmeyer <[hidden email]> wrote:
>
> No significant change.  The target filesystem only caches non-aligned
> writes, so there usually isn't anything for it to do on fsync anyway.
>

OK.  I don't have any more ideas at the moment, and without access to
your code, and your platform, I can't really debug it.  But do please
know that you should only be getting less than 2x writes.  I suggest
adding instrumentation and trying to come up with a simplified test
case.


--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Theoretical write performance for low-throughput devices

Rowan Worth-2
In reply to this post by Jonathan Brandmeyer
On Mon, 21 Oct 2019 at 23:28, Jonathan Brandmeyer <[hidden email]>
wrote:

> Or, how many times is each page written by SQLite for an insert-heavy
> test?  The answer appears to be "4", but I can only account for two of
> those four.
>
> I'm working on an embedded system that uses a log-structured
> filesystem on raw NAND flash.  This is not your typical workstation's
> managed flash (SATA/NVMe), or portable managed flash (SD/USB).  It's a
> bare-nekkid ONFI-speaking chip.  All reads and writes are one 2kB page
> at a time.  There is no readahead, and no write buffering by the
> driver or filesystem for page-sized writes.
>
> We got the following performance numbers out of the flash storage:
>
> Streaming reads through the filesystem: 7.5 MB/s.
> Streaming writes through the filesystem: 5.4 MB/s.
> Single insert performance through SQLite: 0.2 MB/s.
> Bulk insert performance through SQLIte: 1.3 MB/s, asymptotic for very
> large transactions.
>

There's perhaps a measurement missing here: random-access/seeking writes
through the filesystem. Which doesn't sound like it should be a factor
based on the technology involved, but it's more reflective of sqlite's
workload when updating the DB.

The smallest possible insert here modifies at least one DB page for the
table and one for the index, so that's 8kb written to the journal and 8kb
to the main DB. But bulk inserts should greatly reduce the impact of index
writes, as you said.

Hm, how does deletion/truncation perform on the target filesystem?

-Rowan
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Theoretical write performance for low-throughput devices

wmertens
In reply to this post by Jonathan Brandmeyer
On Mon, Oct 21, 2019 at 5:28 PM Jonathan Brandmeyer <
[hidden email]> wrote:

> I'm working on an embedded system that uses a log-structured
> filesystem on raw NAND flash.  This is not your typical workstation's
> managed flash (SATA/NVMe), or portable managed flash (SD/USB).  It's a
> bare-nekkid ONFI-speaking chip.  All reads and writes are one 2kB page
> at a time.  There is no readahead, and no write buffering by the
> driver or filesystem for page-sized writes.
>

 This probably won't change a thing, but I wonder why you wouldn't set the
sqlite page size to 2KB? Hopefully this means lots of aligned writes.

Wout.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Theoretical write performance for low-throughput devices

Jonathan Brandmeyer
On Tue, Oct 22, 2019 at 2:03 AM Wout Mertens <[hidden email]> wrote:
>  This probably won't change a thing, but I wonder why you wouldn't set the
> sqlite page size to 2KB? Hopefully this means lots of aligned writes.

At one point, the row blobs were exactly 1024 bytes.  This isn't great
for 4kB pages, but it is a particularly unfavorable row size for 2 kB
database pages.

--
Jonathan Brandmeyer
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Theoretical write performance for low-throughput devices

Jonathan Brandmeyer
In reply to this post by Richard Hipp-3
On Mon, Oct 21, 2019 at 5:47 PM Richard Hipp <[hidden email]> wrote:

>
> On 10/21/19, Jonathan Brandmeyer <[hidden email]> wrote:
> >
> > No significant change.  The target filesystem only caches non-aligned
> > writes, so there usually isn't anything for it to do on fsync anyway.
> >
>
> OK.  I don't have any more ideas at the moment, and without access to
> your code, and your platform, I can't really debug it.  But do please
> know that you should only be getting less than 2x writes.  I suggest
> adding instrumentation and trying to come up with a simplified test
> case.

Should the vLogVfs be picking up writes to both the WAL and the
database proper?  I'm seeing a number of WRITE's in the log equal to
the final number of pages plus a small percentage.  For the case where
the WAL is much larger than the benchmark, the number of WRITE's in
the log is almost equal to the number of final pages in the database.

Thanks,
--
Jonathan Brandmeyer
PlanetiQ
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Theoretical write performance for low-throughput devices

Keith Medcalf
In reply to this post by Jonathan Brandmeyer
>and only one thread accesses that connection at a time.  The target
>lacks mmap(), posix file locks, and multiple address spaces, so this
>seemed like the right settings to use.

So what happens to the shm file?  Is in not normally just an ordinary file that is mmap'ped?

What would be the effect of not having mmap available?

--
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Theoretical write performance for low-throughput devices

Keith Medcalf

Ok, skip that.  locking_mode=EXCLUSIVE (1) takes care of that.

--
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.

>-----Original Message-----
>From: sqlite-users <[hidden email]> On
>Behalf Of Keith Medcalf
>Sent: Tuesday, 22 October, 2019 12:46
>To: SQLite mailing list <[hidden email]>
>Subject: Re: [sqlite] Theoretical write performance for low-throughput
>devices
>
>>and only one thread accesses that connection at a time.  The target
>>lacks mmap(), posix file locks, and multiple address spaces, so this
>>seemed like the right settings to use.
>
>So what happens to the shm file?  Is in not normally just an ordinary
>file that is mmap'ped?
>
>What would be the effect of not having mmap available?
>
>--
>The fact that there's a Highway to Hell but only a Stairway to Heaven
>says a lot about anticipated traffic volume.
>
>
>
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users