How to achieve fastest possible write performance for a strange and limited case

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

How to achieve fastest possible write performance for a strange and limited case

Jeff Archer-3
I have previously made an apparently bad assumption about this so now I
would like to go back to the beginning of the problem and ask the most
basic question first without any preconceived ideas.

This use case is from an image processing application.  I have a large
amount of intermediate data (way exceeds physical memory on my 24GB
machine).  So, I need to store it temporarily on disk until getting to next
phase of processing.  I am planning to use a large SSD dedicated to holding
this temporary data.  I do not need any recoverability in case of hardware,
power or other failure.   Each item to be stored is 9 DWORDs, 4 doubles and
2 variable sized BLOBS which are images.

I could write directly to a file myself.  But I would need to provide some
minimal indexing, some amount of housekeeping to manage variable
sized BLOBS and some minimal synchronization so that multiple instances of
the same application could operate simultaneously on a single set of data.

So, then I though that SQLite could manage these things nicely for me so
that I don't have to write and debug indexing and housekeeping code that
already exists in SQLite.

So, question is:  What is the way to get the fastest possible performance
from SQLite when I am willing to give up all recoverability guarantees?
Or, is it simple that I should just write directly to file myself?
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to achieve fastest possible write performance for a strange and limited case

Ibrahim A
On 29.03.2013 20:17, Jeff Archer wrote:

> I have previously made an apparently bad assumption about this so now I
> would like to go back to the beginning of the problem and ask the most
> basic question first without any preconceived ideas.
>
> This use case is from an image processing application.  I have a large
> amount of intermediate data (way exceeds physical memory on my 24GB
> machine).  So, I need to store it temporarily on disk until getting to next
> phase of processing.  I am planning to use a large SSD dedicated to holding
> this temporary data.  I do not need any recoverability in case of hardware,
> power or other failure.   Each item to be stored is 9 DWORDs, 4 doubles and
> 2 variable sized BLOBS which are images
>
> I could write directly to a file myself.  But I would need to provide some
> minimal indexing, some amount of housekeeping to manage variable
> sized BLOBS and some minimal synchronization so that multiple instances of
> the same application could operate simultaneously on a single set of data.
>
> So, then I though that SQLite could manage these things nicely for me so
> that I don't have to write and debug indexing and housekeeping code that
> already exists in SQLite.
>
> So, question is:  What is the way to get the fastest possible performance
> from SQLite when I am willing to give up all recoverability guarantees?
> Or, is it simple that I should just write directly to file myself?
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Suggestion : Put the fixed Data with small sizes into a sqlite
database.  You won't search in the blobs with a database engine and the
amount of data you have to process is large to make it fast you should
write the image data into files. The other data which is necessary for
processing ordering, indexing, searching comparision is best stored in a
sqlite database.

To improve the speed of access for your images use full pages fill
lesser images to the next page boundaries (as an example 4 k, 8 k ...)
splitt long files into smaller clusters (16 to 64 MB) sequentially
numbered this makes OS file operations faster because you have to cache
the block index while opening and processing a file the positions can be
indexed in sqlite.

I have a similar application for vectorized digitalization of
handwritten old scripts and i use a database for searchable information
while using external files (splitt as described) for raster images and
vector files sqlite would be to slow for blobs like you need them put
them outside but the indexes inside. Another advantage of this approach
is that you can process many binary files simultanously while by putting
them inside a database like sqlite you have only one writer.

The use of transactions makes inserting of data faster especially when
you have indexes. Then try to create your indexes after fully inserting
your data because that makes the process faster.


_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to achieve fastest possible write performance for a strange and limited case

Simon Slavin-3
In reply to this post by Jeff Archer-3

On 29 Mar 2013, at 7:17pm, Jeff Archer <[hidden email]> wrote:

> This use case is from an image processing application.  I have a large
> amount of intermediate data (way exceeds physical memory on my 24GB
> machine).  So, I need to store it temporarily on disk until getting to next
> phase of processing.  I am planning to use a large SSD dedicated to holding
> this temporary data.  I do not need any recoverability in case of hardware,
> power or other failure.   Each item to be stored is 9 DWORDs, 4 doubles and
> 2 variable sized BLOBS which are images.
>
> I could write directly to a file myself.

That's what I thought: just blat it straight out to a text file in append mode.  Organise it later.  But if you don't want to do that ...

> But I would need to provide some
> minimal indexing, some amount of housekeeping to manage variable
> sized BLOBS and some minimal synchronization so that multiple instances of
> the same application could operate simultaneously on a single set of data.

Locking between different simultaneous apps is going to kill you.  Do you really need, at this stage, one app to access data written by different writers ?  If not, have each app write to a different database.  Then use a different app running on a different computer (or at least in a different process) to merge the data into one big file, either simultaneously or after data-collection is finished.

> So, then I though that SQLite could manage these things nicely for me so
> that I don't have to write and debug indexing and housekeeping code that
> already exists in SQLite.
>
> So, question is:  What is the way to get the fastest possible performance
> from SQLite when I am willing to give up all recoverability guarantees?
> Or, is it simple that I should just write directly to file myself?

<http://www.sqlite.org/pragma.html>

Journalling off and Synchronous off pragmas.  And I note you're asking for 'fastest' whereas you really just want 'fast enough'.

However I note that you seem to have >24Gig of data to write.  At that size, the speed of SQLite isn't your limiting factor.  Instead you have to worry about the speed of your storage medium.  With 4K sectors, writing 24Gig of data means you're writing 3 million sectors.  If you're writing to rotating hard disks that means you'll be waiting for 3 million sectors to be rotated into the right place.  Even at 10,000 RPM that's a lot of waiting.  Your biggest speed increase isn't going to come from clever programming, it's going to come from moving to SSD.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to achieve fastest possible write performance for a strange and limited case

Pavel Ivanov-2
In reply to this post by Jeff Archer-3
On Fri, Mar 29, 2013 at 12:17 PM, Jeff Archer
<[hidden email]> wrote:

> I have previously made an apparently bad assumption about this so now I
> would like to go back to the beginning of the problem and ask the most
> basic question first without any preconceived ideas.
>
> This use case is from an image processing application.  I have a large
> amount of intermediate data (way exceeds physical memory on my 24GB
> machine).  So, I need to store it temporarily on disk until getting to next
> phase of processing.  I am planning to use a large SSD dedicated to holding
> this temporary data.  I do not need any recoverability in case of hardware,
> power or other failure.   Each item to be stored is 9 DWORDs, 4 doubles and
> 2 variable sized BLOBS which are images.
>
> I could write directly to a file myself.  But I would need to provide some
> minimal indexing, some amount of housekeeping to manage variable
> sized BLOBS and some minimal synchronization so that multiple instances of
> the same application could operate simultaneously on a single set of data.
>
> So, then I though that SQLite could manage these things nicely for me so
> that I don't have to write and debug indexing and housekeeping code that
> already exists in SQLite.
>
> So, question is:  What is the way to get the fastest possible performance
> from SQLite when I am willing to give up all recoverability guarantees?

Use
pragma journal_mode = off;
pragma synchronous = off;
pragma locking_mode = exclusive;

In addition to that you may issue BEGIN statement at the beginning of
the application and never COMMIT.


Pavel
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to achieve fastest possible write performance for a strange and limited case

Dominique Pellé
Pavel Ivanov <[hidden email]> wrote:

> On Fri, Mar 29, 2013 at 12:17 PM, Jeff Archer
> <[hidden email]> wrote:
>> I have previously made an apparently bad assumption about this so now I
>> would like to go back to the beginning of the problem and ask the most
>> basic question first without any preconceived ideas.
>>
>> This use case is from an image processing application.  I have a large
>> amount of intermediate data (way exceeds physical memory on my 24GB
>> machine).  So, I need to store it temporarily on disk until getting to next
>> phase of processing.  I am planning to use a large SSD dedicated to holding
>> this temporary data.  I do not need any recoverability in case of hardware,
>> power or other failure.   Each item to be stored is 9 DWORDs, 4 doubles and
>> 2 variable sized BLOBS which are images.
>>
>> I could write directly to a file myself.  But I would need to provide some
>> minimal indexing, some amount of housekeeping to manage variable
>> sized BLOBS and some minimal synchronization so that multiple instances of
>> the same application could operate simultaneously on a single set of data.
>>
>> So, then I though that SQLite could manage these things nicely for me so
>> that I don't have to write and debug indexing and housekeeping code that
>> already exists in SQLite.
>>
>> So, question is:  What is the way to get the fastest possible performance
>> from SQLite when I am willing to give up all recoverability guarantees?
>
> Use
> pragma journal_mode = off;
> pragma synchronous = off;
> pragma locking_mode = exclusive;
>
> In addition to that you may issue BEGIN statement at the beginning of
> the application and never COMMIT.


Yes that should be the fastest.

In addition:

* make sure that you prepare your INSERT query just once
  before the insert loop, and use bind/step/reset in the
  insertion loop.

* avoid duplicating the implicit uid index with another
  primary key by using INTEGER PRIMARY KEY
  and not something like INT PRIMARY KEY

* if you have indexes, make sure that you create them
  after all the inserts, rather than before.

* if you have several tables to populate, you can consider
  storing them in different databases, and populate them
  in parallel in different processes or threads. Then later
  you can ATTACH all of them to see them as a unique
  database.

Regards
Dominique
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to achieve fastest possible write performance for a strange and limited case

Michael Black
In reply to this post by Jeff Archer-3
I think many people would tell you not to store your images in your
database.
Just store a filepath to them.
That will speed things up quite a bit and even possibly prevent having to
use an SSD.

With the filepath your processing apps can use file locking too if you need
it.


-----Original Message-----
From: [hidden email]
[mailto:[hidden email]] On Behalf Of Jeff Archer
Sent: Friday, March 29, 2013 2:18 PM
To: SQLite-user.org
Subject: [sqlite] How to achieve fastest possible write performance for a
strange and limited case

I have previously made an apparently bad assumption about this so now I
would like to go back to the beginning of the problem and ask the most
basic question first without any preconceived ideas.

This use case is from an image processing application.  I have a large
amount of intermediate data (way exceeds physical memory on my 24GB
machine).  So, I need to store it temporarily on disk until getting to next
phase of processing.  I am planning to use a large SSD dedicated to holding
this temporary data.  I do not need any recoverability in case of hardware,
power or other failure.   Each item to be stored is 9 DWORDs, 4 doubles and
2 variable sized BLOBS which are images.

I could write directly to a file myself.  But I would need to provide some
minimal indexing, some amount of housekeeping to manage variable
sized BLOBS and some minimal synchronization so that multiple instances of
the same application could operate simultaneously on a single set of data.

So, then I though that SQLite could manage these things nicely for me so
that I don't have to write and debug indexing and housekeeping code that
already exists in SQLite.

So, question is:  What is the way to get the fastest possible performance
from SQLite when I am willing to give up all recoverability guarantees?
Or, is it simple that I should just write directly to file myself?
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to achieve fastest possible write performance for a strange and limited case

Richard Hipp-3
On Fri, Mar 29, 2013 at 4:41 PM, Michael Black <[hidden email]> wrote:

> I think many people would tell you not to store your images in your
> database.
> Just store a filepath to them.
> That will speed things up quite a bit and even possibly prevent having to
> use an SSD.
>
> With the filepath your processing apps can use file locking too if you need
> it.
>

Lots of people think this.  But experiments suggest otherwise, depending on
the size of your images.

For BLOBs smaller than about 100KB, it is actually faster to read them out
of an SQLite database than it is out of a file.  Presumably the extra
overhead of the open/close is what makes the difference.  For BLOBs larger
than 100KB it is faster to store them in a file.  That's for reading - I've
never run the experiment for writing.  Note also that the 100KB is highly
operating-system and file-system and storage-media dependent and so your
mileage may vary.

--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to achieve fastest possible write performance for a strange and limited case

Jeff Archer-3
On Fri, Mar 29, 2013 at 4:46 PM, Richard Hipp <[hidden email]> wrote:

> On Fri, Mar 29, 2013 at 4:41 PM, Michael Black <[hidden email]> wrote:
>
>> I think many people would tell you not to store your images in your
>> database.
>> Just store a filepath to them.
>> That will speed things up quite a bit and even possibly prevent having to
>> use an SSD.
>>
>> With the filepath your processing apps can use file locking too if you need
>> it.
>>
>
> Lots of people think this.  But experiments suggest otherwise, depending on
> the size of your images.
>
> For BLOBs smaller than about 100KB, it is actually faster to read them out
> of an SQLite database than it is out of a file.  Presumably the extra
> overhead of the open/close is what makes the difference.  For BLOBs larger
> than 100KB it is faster to store them in a file.  That's for reading - I've
> never run the experiment for writing.  Note also that the 100KB is highly
> operating-system and file-system and storage-media dependent and so your
> mileage may vary.
>

FWIW.  My tests are definitely indicating that BLOBs in database are
significantly faster.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to achieve fastest possible write performance for a strange and limited case

David King
In case you haven't seen this: http://www.sqlite.org/intern-v-extern-blob.html 


On Friday, 29 March, 2013 at 13:52, Jeff Archer wrote:

> On Fri, Mar 29, 2013 at 4:46 PM, Richard Hipp <[hidden email] (mailto:[hidden email])> wrote:
> > On Fri, Mar 29, 2013 at 4:41 PM, Michael Black <[hidden email] (mailto:[hidden email])> wrote:
> >
> > > I think many people would tell you not to store your images in your
> > > database.
> > > Just store a filepath to them.
> > > That will speed things up quite a bit and even possibly prevent having to
> > > use an SSD.
> > >
> > > With the filepath your processing apps can use file locking too if you need
> > > it.
> > >
> >
> >
> > Lots of people think this. But experiments suggest otherwise, depending on
> > the size of your images.
> >
> > For BLOBs smaller than about 100KB, it is actually faster to read them out
> > of an SQLite database than it is out of a file. Presumably the extra
> > overhead of the open/close is what makes the difference. For BLOBs larger
> > than 100KB it is faster to store them in a file. That's for reading - I've
> > never run the experiment for writing. Note also that the 100KB is highly
> > operating-system and file-system and storage-media dependent and so your
> > mileage may vary.
> >
>
>
> FWIW. My tests are definitely indicating that BLOBs in database are
> significantly faster.
> _______________________________________________
> sqlite-users mailing list
> [hidden email] (mailto:[hidden email])
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
>


_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to achieve fastest possible write performance for a strange and limited case

Ibrahim A
Those measurements asume that you store each blob in a seperate file. So
the raw file access seems slower for smaller blob sizes.

If you use external blob storage do it in raw clusters like i suggested
in a previous post (size limit 32/64 MB) and store your blobs on page
boundaries (page size 4 k 8 k aso) this will always be faster cause you
have no b-tree pages which are always fragmented but sequential stored
image data.

Don't use file sizes larger than 32/64 MB because the pre fetch cache of
modern HD's can read the whole file even if you only ask for a port of
it and the fopen command will get slower when you open large files cause
you have to read the pagelist into an internal library buffer.

The given link is only true if you store each blob in a seperate file.

I use for my similar project raw cluster modell and thats x times faster
than storing image data in a b-tree organized database file.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to achieve fastest possible write performance for a strange and limited case

Dominique Devienne
In reply to this post by Jeff Archer-3
On Fri, Mar 29, 2013 at 8:17 PM, Jeff Archer <
[hidden email]> wrote:

> I could write directly to a file myself.  But I would need to provide some
> minimal indexing, some amount of housekeeping to manage variable
> sized BLOBS and some minimal synchronization so that multiple instances of
> the same application could operate simultaneously on a single set of data.
>

Have you looked at HDF5? Aside from synchronization, HDF5 gives you the
indexing and book-keeping. It was designed for large "structured" binary
data. By structured I mean storing n-D arrays of "structs", that can be
"chunked" or "tilled"  and compressed as desired, transparently, and
accessed in whole or parts (hyperslabs) as desired, optimizing the access
for you. An HDF5 file is basically like a file-system, with folders, and
the "files" are "datasets", and HDF5 knows what's inside each dataset (say
a 3D array of {x,y,z} doubles in a Point struct). It's structured data for
binary scientific data. You can associate "attributes" (key, value pairs,
the value's type, primitive or aggregate, known to HDF5 like for datasets)
to datasets, which are in B-trees, and the whole thing is "paged" with a
B-tree of pages and a page cache, similar to sqlite3. HDF5 is fast. I've
often wished for the ability to "type" my sqlite blobs similar to how
everything is typed in HDF5. You can in fact stuff HDF5 inside a blob,
since HDF5 as a VFL (again just like sqlite3. the parallels are many), but
you can't do much with that in-blob data using sqlite SQL, and since custom
functions can only return scalars, there's little point. Might as well
store the HDF5 files on disk next to the sqlite db file. But I guess I'm
digressing here :)

We've used a mix of SQLite3 and HDF5, which worked well for us, but we
didn't have to worry about concurrent access though. --DD
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to achieve fastest possible write performance for a strange and limited case

Eduardo Morras-2
In reply to this post by Jeff Archer-3
On Fri, 29 Mar 2013 15:17:52 -0400
Jeff Archer <[hidden email]> wrote:

> I have previously made an apparently bad assumption about this so now I
> would like to go back to the beginning of the problem and ask the most
> basic question first without any preconceived ideas.
>
> This use case is from an image processing application.  I have a large
> amount of intermediate data (way exceeds physical memory on my 24GB
> machine).  So, I need to store it temporarily on disk until getting to next
> phase of processing.  I am planning to use a large SSD dedicated to holding
> this temporary data.  I do not need any recoverability in case of hardware,
> power or other failure.   Each item to be stored is 9 DWORDs, 4 doubles and
> 2 variable sized BLOBS which are images.
>
> I could write directly to a file myself.  But I would need to provide some
> minimal indexing, some amount of housekeeping to manage variable
> sized BLOBS and some minimal synchronization so that multiple instances of
> the same application could operate simultaneously on a single set of data.
>
> So, then I though that SQLite could manage these things nicely for me so
> that I don't have to write and debug indexing and housekeeping code that
> already exists in SQLite.
>
> So, question is:  What is the way to get the fastest possible performance
> from SQLite when I am willing to give up all recoverability guarantees?
> Or, is it simple that I should just write directly to file myself?

Piping through gzip -6 or xz -2 will minimize bytes to write. If you are working with 5D images xz (7-zip LZMA fork) will do the best.

For processing you do zcat file | processing_application or xzcat file | processing_application

---   ---
Eduardo Morras <[hidden email]>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to achieve fastest possible write performance for a strange and limited case

Rob Sciuk
In reply to this post by Jeff Archer-3

I'm not sure exactly what you're asking here, but if the question is
whether to use database blobs vs files, then you might be interested in
this technical report from Microsoft:

  arxiv.org/ftp/cs/papers/0701/0701168.pdf


--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=--=-=-=-=
Robert S. Sciuk http://www.controlq.com                259 Simcoe St. S.
Control-Q Research tel: 905.706.1354 Oshawa, Ont.
[hidden email] Canada, L1H 4H3
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to achieve fastest possible write performance for a strange and limited case

Jeff Archer-3
Thanks for everyone's help and thoughts on this issue.

My findings on Windows 7 Pro 64 using a PCI based SSD is that for my
smallish image the BLOBs were faster than individual files.
Basically, in line with a table that someone posted earlier in this
thread.

After many experiments, with many variations on grouping writes into
transactions, the single writer proves to be the most limiting factor.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users