Custom collation of blobs

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Custom collation of blobs

Jens Alfke-2
We are using SQLite blobs to store some structured values, and need control over how they are collated in queries, i.e. memcmp is not the correct ordering. We’ve registered a custom collating function, but unfortunately it doesn’t get called. According to the docs, collating functions are only used to compare strings, not any other data type. Is there any way around this limitation?

The only workaround I can think of is to define a custom function that converts a blob to a string and collate using those strings — e.g. `… ORDER BY collatable_blob(b)`. But this requires expensive string conversions, and it doesn’t work well with indexes.

It would be much cleaner and more efficient if there were a type of collating function that operated on all data types.

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Custom collation of blobs

Dominique Devienne
On Fri, Apr 26, 2019 at 7:36 PM Jens Alfke <[hidden email]> wrote:

> We are using SQLite blobs to store some structured values, and need
> control over how they are collated in queries, i.e. memcmp is not the
> correct ordering. We’ve registered a custom collating function, but
> unfortunately it doesn’t get called. According to the docs, collating
> functions are only used to compare strings, not any other data type. Is
> there any way around this limitation?
>
> The only workaround I can think of is to define a custom function that
> converts a blob to a string and collate using those strings — e.g. `… ORDER
> BY collatable_blob(b)`. But this requires expensive string conversions, and
> it doesn’t work well with indexes.
>
> It would be much cleaner and more efficient if there were a type of
> collating function that operated on all data types.
>

You are hitting what is IMHO one of the last big remaining "hole" in
SQLite, which is its lack of User-Defined Types (UDTs) for persistent
columns.
(for transient values in function call chains, we have subtypes nowadays,
which is as close as a UDT as SQLite as).

Your blob obviously has internal structure, and while you're happy to store
it as an opaque blob, you'd like to have it sort differently.
You can of course work-around having a real UDT by assigning a custom
collation to the column, but really it belongs on the "type", not the
column.

Coming back on your more specific question, aside from DRH adding support
for User-Defined Collations to blob columns (a long shot I suspect),
you could store your "structured values" as strings instead of blobs.
AFAIK, SQLite does not check the string is valid UTF8, so can store
arbitrary binary too.
As long as you do not perform encoding conversions UTF8 <-> UTF16, and you
do not call SQL functions assuming valid UTF8, you should be fine.
And now you can have your custom collation.

Note friendly to users of the DB that might see garbage when printing those
"binary" strings though.
But at this point, that's the only thing you can do that avoid the
expensive conversions you pointed out. --DD
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Custom collation of blobs

R Smith-2
To add to Dominique's suggestion, we use this approach a lot and have
sort of standardized it internally. Of course 1NF dictates that this is
not the real RDBMS way, but sometimes you need blobs because you just do.

I'm sure you already have figured out how to do it sans blob collations,
and the question is only aimed at the collations and not needing
alternate ideas, but on the off chance - here goes:

What we do is have blobs compressed if needed[1], then translate it to
string using good old 3 to 4 translation (aka Base64)[2] which is rather
efficient (much more so than bin --> Hex which is 1 to 2, doubling the
size), then append an ante to the string "blob" that is a unique Key of
the Blob which is also what it is sorted by (and a couple more
characters that say a: whether it is compressed and b: the Blob version
so that future software changes remain able to read the current blob
architecture).

 From there it's an easy step to make a standardized blob-storage thingy
with pack/unpack to some structure, JSON, etc.  We even have these as
UDFs, though I must say, in practice we never really use the UDF version
since the content of the blob never really comes into play in SQL - if
it does, we'd rather add real schema columns for that bit of the
information.[3]

If sorting is all you need from all this, it's rather easy to also
simply add another column to the table that contains the sort-by key and
use that column whenever you query-sort it or compare it.

Lastly, you must already be aware, but just in case someone else is
reading this: If you do have more than one column, place the Blob column
at the end of the table definition.


Cheers,
Ryan

[1]: We start compressing if Len > 2560 bytes - we just found this
number through running some research, a different figure might work for
you. The compression algorithm varies for us between maximizing size
saving vs. speed, and we can see a saving before 2560 bytes, but it
seems for the general case, the timing of the total
Base64.decode(compressed data)+LZ.decompress() starts being sporadically
faster than just Base64.decompress(non-compressed data) times after
about 3 thousand bytes, so even though a size saving  starts under 1K, a
speed saving only starts around 3K+. Also, in case anyone wonders,
Base64 compresses really badly - there is no point in first doing Base64
then compressing, it's always a loss.
Mind you, this was long ago, perhaps I will find the code, modernize it
and run tests again, some things may have changed.

[2]: As an aside, because of the only problem really being the null
character in strings, we've designed another Base64-like translation
that in stead of 8-bit to 6-bit (3x8bits --> 4x6bit) does 8bit to 7bit
(7x8bit to 8x7bit) using a full 128-character palette plus a few control
characters, which is not hard to find in a 256 character ASCII range
(well, about 220-ish usable characters) and gives a great 7/8 size ratio
(as opposed to 3/4) but it breaks UTF8 encoding (because of bit 7) so
you must never load the stored strings through a UTF8 translation. In
fact we've decided after some work on this that the size saving from 3/4
to 7/8 (or put another way: 25% waste down to 12.5% waste) is just not
worth the complication and non-standard translation. Perhaps one day
when we have blobs in the 50-megabytes+, but then binding them as actual
BLOBs and adding a couple of other control columns would be the better
answer.

[3]: String storage has a hard limit of ~2Gigs in some systems, though
you should have opted for another way of storage very long before you
get to this limit.


On 2019/04/27 12:35 PM, Dominique Devienne wrote:

> On Fri, Apr 26, 2019 at 7:36 PM Jens Alfke <[hidden email]> wrote:
>
>> We are using SQLite blobs to store some structured values, and need
>> control over how they are collated in queries, i.e. memcmp is not the
>> correct ordering. We’ve registered a custom collating function, but
>> unfortunately it doesn’t get called. According to the docs, collating
>> functions are only used to compare strings, not any other data type. Is
>> there any way around this limitation?
>>
>> The only workaround I can think of is to define a custom function that
>> converts a blob to a string and collate using those strings — e.g. `… ORDER
>> BY collatable_blob(b)`. But this requires expensive string conversions, and
>> it doesn’t work well with indexes.
>>
>> It would be much cleaner and more efficient if there were a type of
>> collating function that operated on all data types.
>>
> You are hitting what is IMHO one of the last big remaining "hole" in
> SQLite, which is its lack of User-Defined Types (UDTs) for persistent
> columns.
> (for transient values in function call chains, we have subtypes nowadays,
> which is as close as a UDT as SQLite as).
>
> Your blob obviously has internal structure, and while you're happy to store
> it as an opaque blob, you'd like to have it sort differently.
> You can of course work-around having a real UDT by assigning a custom
> collation to the column, but really it belongs on the "type", not the
> column.
>
> Coming back on your more specific question, aside from DRH adding support
> for User-Defined Collations to blob columns (a long shot I suspect),
> you could store your "structured values" as strings instead of blobs.
> AFAIK, SQLite does not check the string is valid UTF8, so can store
> arbitrary binary too.
> As long as you do not perform encoding conversions UTF8 <-> UTF16, and you
> do not call SQL functions assuming valid UTF8, you should be fine.
> And now you can have your custom collation.
>
> Note friendly to users of the DB that might see garbage when printing those
> "binary" strings though.
> But at this point, that's the only thing you can do that avoid the
> expensive conversions you pointed out. --DD
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Custom collation of blobs

James K. Lowden
In reply to this post by Jens Alfke-2
On Fri, 26 Apr 2019 10:36:34 -0700
Jens Alfke <[hidden email]> wrote:

> The only workaround I can think of is to define a custom function
> that converts a blob to a string and collate using those strings ?
> e.g. `? ORDER BY collatable_blob(b)`. But this requires expensive
> string conversions,

Your UDF need not convert to a string.  It could return an integer or,
if that's not enough, couldn't it return a different blob whose bit
order yields the correct ordering?  

> and it doesn?t work well with indexes.

If indexing is important, that's a clue that the blob is not your
friend to start with.  It might be worthwhile to store the searchable
components in "redundant" columns that can be readily indexed.  By
analogy, if your blob is a PDF, it might be convenient to store the
author and title in columns for search & display purposes.  

If you write UDFs to pull out the pieces, you could invoke them in a
trigger to keep the "redundant" columns consistent.  

--jkl
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users