Questions on big blobs and blob I/O

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Questions on big blobs and blob I/O

Jens Alfke-2
I’m thinking of transitioning from external to internal storage of large blobs. That is, currently I store these as individual files in a directory next to the SQLite database. But it sounds like today’s new improved SQLite is better at managing big blobs, and supports streaming reads; and it would simplify my code to not have to deal with all those files.

(These blobs are created by a “document attachments” API of the library I work on. What’s in them is up to the developer. They’re often fairly small, e.g. a thumbnail JPEG, but they could be large media files like movies, in the tens or hundreds of megabytes. There’s no upper limit.)

I have two questions:

* Do big blobs cause any performance problems for the db as a whole, like fragmentation? In particular, is the blob data stored contiguously in the file?

* After a big blob is deleted, my understanding is that the db file won’t shrink, so the space isn’t reclaimed to the filesystem. Instead the freed pages in the db file will be reused later on. If I urgently need to shrink the file, do I have to run a full (expensive) vacuum? (The problem with vacuum is that it requires lots of free space to work, but blob deletions may be happening on behalf of the user, to free up storage when the [mobile] device is running out of free space, creating a catch-22…)

* Is there any way to avoid a sqlite3_blob handle being invalidated when the blob’s row is deleted (via a DELETE on another db handle?) I’d prefer to have behavior like file deletion in Unix, where open file handles keep working even after a file is deleted. I’m wondering whether I can accomplish this by opening a read transaction before opening the blob, then leaving it open until the blob is closed. Since I’m using WAL, will this db handle keep a snapshot of the time before the blob’s row was deleted, allowing the blob handle to keep working?

Thanks!

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Questions on big blobs and blob I/O

Simon Slavin-3

On 29 Mar 2017, at 6:14pm, Jens Alfke <[hidden email]> wrote:

> * Do big blobs cause any performance problems for the db as a whole, like fragmentation? In particular, is the blob data stored contiguously in the file?

Blobs are stored in the same pages that other values are stored in.  If a blob is bigger than a page it needs to be split, with the continuation on another page.  So yes, this can lead to internal defragmentation of the database.

However, with modern storage systems fragmentation causes very little decrease in access time.  Access to 'sectors' of solid state storage are not related to the order in which sectors are stored.  And even on rotating storage so much use is made of virtual sectors for wear-balancing purposes that consecutive sectors are often not consecutive.  Any advice to users to run defragmentation software I have to consider as an obsolete hang-over from obsolete storage systems, merely causing additional pointless wear.

> * After a big blob is deleted, my understanding is that the db file won’t shrink, so the space isn’t reclaimed to the filesystem. Instead the freed pages in the db file will be reused later on. If I urgently need to shrink the file, do I have to run a full (expensive) vacuum?

All the above is correct, as far as I know.  Pages and parts of pages are not released from filespace, they’re just listed as 'free' and will be used when SQLite needs more storage space for that database.  In your situation you should take a look at

<https://sqlite.org/pragma.html#pragma_auto_vacuum>

use FULL mode or perhaps use INCREMENTAL mode and perform the incremental vacuums as your app quits.  Note that both of these modes are misnamed.  Read the descriptions, don’t depend on the names.

> * Is there any way to avoid a sqlite3_blob handle being invalidated when the blob’s row is deleted (via a DELETE on another db handle?) I’d prefer to have behavior like file deletion in Unix, where open file handles keep working even after a file is deleted. I’m wondering whether I can accomplish this by opening a read transaction before opening the blob, then leaving it open until the blob is closed. Since I’m using WAL, will this db handle keep a snapshot of the time before the blob’s row was deleted, allowing the blob handle to keep working?

I think that the answer here is negative (unless you play C-type pointer games with memory blocks) but I have to let the devs answer this one.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Questions on big blobs and blob I/O

Olivier Mascia
> Le 29 mars 2017 à 19:40, Simon Slavin <[hidden email]> a écrit :
>
>> Since I’m using WAL, will this db handle keep a snapshot of the time before the blob’s row was deleted, allowing the blob handle to keep working?

As Simon said wisely, a word of the developers would clarify this better.

Yet, since you're using WAL it should work.  Else the implementation of the blob API would not obey the transactional/isolation rules and I suppose it would qualify as a (horrible) bug.

--
Best Regards, Meilleures salutations, Met vriendelijke groeten,
Olivier Mascia, http://integral.software


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Questions on big blobs and blob I/O

Dan Kennedy-4
On 03/30/2017 12:52 AM, Olivier Mascia wrote:
>> Le 29 mars 2017 à 19:40, Simon Slavin <[hidden email]> a écrit :
>>
>>> Since I’m using WAL, will this db handle keep a snapshot of the time before the blob’s row was deleted, allowing the blob handle to keep working?
> As Simon said wisely, a word of the developers would clarify this better.
>
> Yet, since you're using WAL it should work.  Else the implementation of the blob API would not obey the transactional/isolation rules and I suppose it would qualify as a (horrible) bug.

Transactions are isolated in SQLite, and in WAL mode readers do not
block writers. So it will work.

Separate database connections for the reader and writer of course.

Dan.


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Questions on big blobs and blob I/O

David Raymond
In reply to this post by Jens Alfke-2
Remember to make the blob field the very last field in your table schema, and avoid "select * from", otherwise you can create performance problems. The data for a record is stored in the same order as the fields are defined, and overflow pages are a linked list. So if you have your 100 MB blob first, and your ID second, then you will have to read through all 100 MB of the blob to get to your ID.

For the freeing up of space check out the incremental version of auto vacuum. Rather than freeing it all up at once it might be good to run it "incrementally" to avoid 1 big performance hit. Maybe have something run every so often to free up a set amount of pages, similar to a scheduled WAL checkpointer.


-----Original Message-----
From: sqlite-users [mailto:[hidden email]] On Behalf Of Jens Alfke
Sent: Wednesday, March 29, 2017 1:14 PM
To: SQLite mailing list
Subject: [sqlite] Questions on big blobs and blob I/O

I’m thinking of transitioning from external to internal storage of large blobs. That is, currently I store these as individual files in a directory next to the SQLite database. But it sounds like today’s new improved SQLite is better at managing big blobs, and supports streaming reads; and it would simplify my code to not have to deal with all those files.

(These blobs are created by a “document attachments” API of the library I work on. What’s in them is up to the developer. They’re often fairly small, e.g. a thumbnail JPEG, but they could be large media files like movies, in the tens or hundreds of megabytes. There’s no upper limit.)

I have two questions:

* Do big blobs cause any performance problems for the db as a whole, like fragmentation? In particular, is the blob data stored contiguously in the file?

* After a big blob is deleted, my understanding is that the db file won’t shrink, so the space isn’t reclaimed to the filesystem. Instead the freed pages in the db file will be reused later on. If I urgently need to shrink the file, do I have to run a full (expensive) vacuum? (The problem with vacuum is that it requires lots of free space to work, but blob deletions may be happening on behalf of the user, to free up storage when the [mobile] device is running out of free space, creating a catch-22…)

* Is there any way to avoid a sqlite3_blob handle being invalidated when the blob’s row is deleted (via a DELETE on another db handle?) I’d prefer to have behavior like file deletion in Unix, where open file handles keep working even after a file is deleted. I’m wondering whether I can accomplish this by opening a read transaction before opening the blob, then leaving it open until the blob is closed. Since I’m using WAL, will this db handle keep a snapshot of the time before the blob’s row was deleted, allowing the blob handle to keep working?

Thanks!

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Questions on big blobs and blob I/O

Richard Hipp-3
On 3/29/17, David Raymond <[hidden email]> wrote:
> Remember to make the blob field the very last field in your table schema,
> and avoid "select * from", otherwise you can create performance problems.
> The data for a record is stored in the same order as the fields are defined,
> and overflow pages are a linked list. So if you have your 100 MB blob first,
> and your ID second, then you will have to read through all 100 MB of the
> blob to get to your ID.

David is correct.  You will do well to follow his advice.

However, for completeness let me add that sometimes SQLite is able to
avoid walking the whole linked list in order to get to data that lives
at the end.

There is a special optimization in the code, that only works when
either auto_vacuum is enabled, that sometimes allows SQLite to skip
over the intermediate pages of a big BLOB and go right to the page at
the end that contains your ID.  Key points are that this optimization
does not work every time and will fall back to walking the list if it
fails, and it never works unless auto_vacuum is turned on.
--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Questions on big blobs and blob I/O

Stephen Chrzanowski
In reply to this post by David Raymond
@Jens;

I'd take what David suggested one step further and just remove the BLOB
from the tables of concern and keep the blob on a 1:1 PK:FK relationship
and only look at the BLOB tables when required.  That way if you do an
occasional [select * from ...] you're not reading the BLOB because it's
living somewhere else.

I make this suggestion not just because of SQLite, but, any DBMS I've
developed for.  Keep the really big data somewhere else, with a
relationship to the primary resource required, and then only query that
massive chunk when required through a specialized view or select.

On Wed, Mar 29, 2017 at 2:11 PM, David Raymond <[hidden email]>
wrote:

> Remember to make the blob field the very last field in your table schema,
> and avoid "select * from", otherwise you can create performance problems.
> The data for a record is stored in the same order as the fields are
> defined, and overflow pages are a linked list. So if you have your 100 MB
> blob first, and your ID second, then you will have to read through all 100
> MB of the blob to get to your ID.
>
> For the freeing up of space check out the incremental version of auto
> vacuum. Rather than freeing it all up at once it might be good to run it
> "incrementally" to avoid 1 big performance hit. Maybe have something run
> every so often to free up a set amount of pages, similar to a scheduled WAL
> checkpointer.
>
>
> -----Original Message-----
> From: sqlite-users [mailto:[hidden email]]
> On Behalf Of Jens Alfke
> Sent: Wednesday, March 29, 2017 1:14 PM
> To: SQLite mailing list
> Subject: [sqlite] Questions on big blobs and blob I/O
>
> I’m thinking of transitioning from external to internal storage of large
> blobs. That is, currently I store these as individual files in a directory
> next to the SQLite database. But it sounds like today’s new improved SQLite
> is better at managing big blobs, and supports streaming reads; and it would
> simplify my code to not have to deal with all those files.
>
> (These blobs are created by a “document attachments” API of the library I
> work on. What’s in them is up to the developer. They’re often fairly small,
> e.g. a thumbnail JPEG, but they could be large media files like movies, in
> the tens or hundreds of megabytes. There’s no upper limit.)
>
> I have two questions:
>
> * Do big blobs cause any performance problems for the db as a whole, like
> fragmentation? In particular, is the blob data stored contiguously in the
> file?
>
> * After a big blob is deleted, my understanding is that the db file won’t
> shrink, so the space isn’t reclaimed to the filesystem. Instead the freed
> pages in the db file will be reused later on. If I urgently need to shrink
> the file, do I have to run a full (expensive) vacuum? (The problem with
> vacuum is that it requires lots of free space to work, but blob deletions
> may be happening on behalf of the user, to free up storage when the
> [mobile] device is running out of free space, creating a catch-22…)
>
> * Is there any way to avoid a sqlite3_blob handle being invalidated when
> the blob’s row is deleted (via a DELETE on another db handle?) I’d prefer
> to have behavior like file deletion in Unix, where open file handles keep
> working even after a file is deleted. I’m wondering whether I can
> accomplish this by opening a read transaction before opening the blob, then
> leaving it open until the blob is closed. Since I’m using WAL, will this db
> handle keep a snapshot of the time before the blob’s row was deleted,
> allowing the blob handle to keep working?
>
> Thanks!
>
> —Jens
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Loading...