Consequences of lexicographic sorting of keys in SQLite4?

classic Classic list List threaded Threaded
36 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Consequences of lexicographic sorting of keys in SQLite4?

Nico Williams
So, if I understand section 3.2 of the SQLite4 design page then it
will often be the case that lookup keys will not be stored in an order
that will be useful for optimizing common ORDER BY expressions.  Is
this correct?  If so, is this worth the trade-off for the single
key/value storage complexity?  Or is this wrong because the key
comparison function to be used will be aware of of data typing in the
encoding of the keys?

Nico
--
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Consequences of lexicographic sorting of keys in SQLite4?

Richard Hipp-3
On Fri, Jun 29, 2012 at 3:40 PM, Nico Williams <[hidden email]>wrote:

> So, if I understand section 3.2 of the SQLite4 design page then it
> will often be the case that lookup keys will not be stored in an order
> that will be useful for optimizing common ORDER BY expressions.  Is
> this correct?


Not correct.  The keys are encoded (see
http://www.sqlite.org/src4/doc/trunk/www/key_encoding.wiki) in a way that
causes a lexicographical ordering of the keys to correspond to what the
user wants out of ORDER BY.  So indices can still be used for fulfilling
ORDER BY.



>  If so, is this worth the trade-off for the single
> key/value storage complexity?  Or is this wrong because the key
> comparison function to be used will be aware of of data typing in the
> encoding of the keys?
>
> Nico
> --
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>



--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Consequences of lexicographic sorting of keys in SQLite4?

Nico Williams
On Fri, Jun 29, 2012 at 2:48 PM, Richard Hipp <[hidden email]> wrote:

> On Fri, Jun 29, 2012 at 3:40 PM, Nico Williams <[hidden email]>wrote:
>> So, if I understand section 3.2 of the SQLite4 design page then it
>> will often be the case that lookup keys will not be stored in an order
>> that will be useful for optimizing common ORDER BY expressions.  Is
>> this correct?
>
>
> Not correct.  The keys are encoded (see
> http://www.sqlite.org/src4/doc/trunk/www/key_encoding.wiki) in a way that
> causes a lexicographical ordering of the keys to correspond to what the
> user wants out of ORDER BY.  So indices can still be used for fulfilling
> ORDER BY.

Ah, I hadn't seen that.  That's very clever.

>>  If so, is this worth the trade-off for the single
>> key/value storage complexity?  Or is this wrong because the key
>> comparison function to be used will be aware of of data typing in the
>> encoding of the keys?
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Consequences of lexicographic sorting of keys in SQLite4?

Cory Nelson
In reply to this post by Richard Hipp-3
On Fri, Jun 29, 2012 at 2:48 PM, Richard Hipp <[hidden email]> wrote:

> On Fri, Jun 29, 2012 at 3:40 PM, Nico Williams <[hidden email]
> >wrote:
>
> > So, if I understand section 3.2 of the SQLite4 design page then it
> > will often be the case that lookup keys will not be stored in an order
> > that will be useful for optimizing common ORDER BY expressions.  Is
> > this correct?
>
>
> Not correct.  The keys are encoded (see
> http://www.sqlite.org/src4/doc/trunk/www/key_encoding.wiki) in a way that
> causes a lexicographical ordering of the keys to correspond to what the
> user wants out of ORDER BY.  So indices can still be used for fulfilling
> ORDER BY.
>

What is the rationale for the 7-bit BINARY encoding? The performance impact
will surely outweigh any convenience of being able to treat blobs as
0-terminated strings.

--
Cory Nelson
http://int64.org
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Consequences of lexicographic sorting of keys in SQLite4?

Nico Williams
On Fri, Jun 29, 2012 at 4:39 PM, Cory Nelson <[hidden email]> wrote:
> On Fri, Jun 29, 2012 at 2:48 PM, Richard Hipp <[hidden email]> wrote:
> What is the rationale for the 7-bit BINARY encoding? The performance impact
> will surely outweigh any convenience of being able to treat blobs as
> 0-terminated strings.

I tend to agree.  Varint length + value should be faster, though this
just needs to be measured, and it's likely that the 7-bit approach
will be faster for very short blobs.  Another possibility would be to
use 31-bit encoding ;)  or escape zero-bytes, and so on.

Nico
--
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Consequences of lexicographic sorting of keys in SQLite4?

Richard Hipp-3
On Fri, Jun 29, 2012 at 6:09 PM, Nico Williams <[hidden email]>wrote:

> On Fri, Jun 29, 2012 at 4:39 PM, Cory Nelson <[hidden email]> wrote:
> > On Fri, Jun 29, 2012 at 2:48 PM, Richard Hipp <[hidden email]> wrote:
> > What is the rationale for the 7-bit BINARY encoding? The performance
> impact
> > will surely outweigh any convenience of being able to treat blobs as
> > 0-terminated strings.
>
> I tend to agree.  Varint length + value should be faster,


varint+value does not sort BLOBs in lexicographical order.

Not having a distinct terminator for the BLOB means that two BLOBs where
one is a prefix of the other might not compare correctly.




> though this
> just needs to be measured, and it's likely that the 7-bit approach
> will be faster for very short blobs.  Another possibility would be to
> use 31-bit encoding ;)  or escape zero-bytes, and so on.
>
> Nico
> --
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>



--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Consequences of lexicographic sorting of keys in SQLite4?

Nico Williams
On Fri, Jun 29, 2012 at 5:24 PM, Richard Hipp <[hidden email]> wrote:
> varint+value does not sort BLOBs in lexicographical order.
>
> Not having a distinct terminator for the BLOB means that two BLOBs where
> one is a prefix of the other might not compare correctly.

Would 31-bit encoding help?
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Consequences of lexicographic sorting of keys in SQLite4?

Richard Hipp-3
On Fri, Jun 29, 2012 at 6:40 PM, Nico Williams <[hidden email]>wrote:

> On Fri, Jun 29, 2012 at 5:24 PM, Richard Hipp <[hidden email]> wrote:
> > varint+value does not sort BLOBs in lexicographical order.
> >
> > Not having a distinct terminator for the BLOB means that two BLOBs where
> > one is a prefix of the other might not compare correctly.
>
> Would 31-bit encoding help?
>

Maybe.

But you know:  How often do people use BLOBs as keys?  What other SQL
engines other than SQLite even allow BLOBs as keys?  Are we trying to
optimize something that is never actually used?

Note that the data
encoding<http://www.sqlite.org/src4/doc/trunk/www/data_encoding.wiki>for
BLOBs (used in the overwhelmingly common case of when the BLOB is not
the key) does have a simple size+content format.


> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>



--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Consequences of lexicographic sorting of keys in SQLite4?

Nico Williams
OK, I give :)
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Consequences of lexicographic sorting of keys in SQLite4?

Niall O'Reilly
In reply to this post by Richard Hipp-3

On 29 Jun 2012, at 23:58, Richard Hipp wrote:

> But you know:  How often do people use BLOBs as keys?  What other SQL
> engines other than SQLite even allow BLOBs as keys?  Are we trying to
> optimize something that is never actually used?

        For an IPAM application I have on my back burner, BLOB seems
        a natural way to express IPv[46] addresses, ranges, and prefixes.
        A bulkier alternative would be hexadecimal encoding as text.

        /Niall

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Consequences of lexicographic sorting of keys in SQLite4?

Dan Kennedy-4
On 07/02/2012 04:29 PM, Niall O'Reilly wrote:

>
> On 29 Jun 2012, at 23:58, Richard Hipp wrote:
>
>> But you know:  How often do people use BLOBs as keys?  What other SQL
>> engines other than SQLite even allow BLOBs as keys?  Are we trying to
>> optimize something that is never actually used?
>
> For an IPAM application I have on my back burner, BLOB seems
> a natural way to express IPv[46] addresses, ranges, and prefixes.
> A bulkier alternative would be hexadecimal encoding as text.

That would be a reasonable use. But the blob in this case will be what,
eight bytes (or 10 in its encoded form)? So the encoding and decoding
(it's actually not clear we will ever want to decode, but anyhow) isn't
going to cost much in the way of CPU. And making the keys memcmp()
compatible allows some other optimizations - prefix compression and so
on. Plus I think memcmp() will be generally faster than any other
type of comparison.

Creating and using an index on larger blobs might be different of
course.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Consequences of lexicographic sorting of keys in SQLite4?

Niall O'Reilly

On 2 Jul 2012, at 10:51, Dan Kennedy wrote:

> That would be a reasonable use. But the blob in this case will be what,
> eight bytes (or 10 in its encoded form)?

        10, 18, 34, or 66, depending on which of six classes [*] of object
        is involved, using the encoding I have in mind at the moment.
        Still small.

        * 2x address families, 3x kinds of object (address, prefix, range).

        /Niall

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Consequences of lexicographic sorting of keys in SQLite4?

Simon Slavin-3
In reply to this post by Niall O'Reilly

On 2 Jul 2012, at 10:29am, Niall O'Reilly <[hidden email]> wrote:

> On 29 Jun 2012, at 23:58, Richard Hipp wrote:
>
>> But you know:  How often do people use BLOBs as keys?  What other SQL
>> engines other than SQLite even allow BLOBs as keys?  Are we trying to
>> optimize something that is never actually used?
>
> For an IPAM application I have on my back burner, BLOB seems
> a natural way to express IPv[46] addresses, ranges, and prefixes.
> A bulkier alternative would be hexadecimal encoding as text.

Strikes me as premature optimisation.  Storing them as strings of decimal or hex (with, of course, leading zeros) would allow you to sort them meaningfully, take substrings meaningfully, and to understand the contents of your file when displayed using debugging tools.  If you do that, and the results turn out to be too slow for your user(s), /then/ revisit ideas of making things faster or more compact.

Worth remembering that BLOBs don't have a well-ordering function.  You can compare two BLOBs and tell whether they're the same (usually, but lossless encoding defeats this), but if they're not the same you can't put one 'before' the other.

This is because BLOBs are essentially black boxes.  You have no idea what the data represents.  If you know what it represented, you'd probably be storing it as text or a number.  Think of storing images as BLOBs.  How do you compare two images ?  Is one before another because it is smaller ?  Or because it contains darker pixels (lower brightness) ?  Or because the EXIF information says it was taken on an earlier date ?  Or because it's an earlier frame in the animation you're making ?

So if a function like building an index requires an ordering function, you can't use it on a BLOB.  Now, as it happens, BLOBs are stored as octets, and if the functionality is presented there's no harm in sorting them as if they're octet-streams.  But it doesn't really mean anything.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Blobs and ordering [was: Consequences of lexicographic sorting of keys in SQLite4?]

Niall O'Reilly
        Simon,

        Thanks for your considered comments.

On 2 Jul 2012, at 12:20, Simon Slavin wrote:

> Worth remembering that BLOBs don't have a well-ordering function.  You can compare two BLOBs and tell whether they're the same (usually, but lossless encoding defeats this), but if they're not the same you can't put one 'before' the other.

        OK, in the general case.

> This is because BLOBs are essentially black boxes.  You have no idea what the data represents.

        If I'm responsible for the data, I can take care that applying memcmp()
        to two BLOBs is meaningful.

>  If you know what it represented, you'd probably be storing it as text or a number.

        I'm not sure I can depend on having 128-bit unsigned integers available.

        Notational options make normalization necessary for text.  With BLOB, I can
        use the result from inet_pton(); with TEXT, I have to apply inet_ntop() to
        the result of inet_pton().  Old-school parsimony makes me disinclined to
        do this.  Perhaps I need to lighten up?

>  Think of storing images as BLOBs.  How do you compare two images ?

        I don't think the analogy applies.  
        Images belong to a different specialization of the same base class.

        Thanks again,

        Niall O'Reilly

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Consequences of lexicographic sorting of keys in SQLite4?

Nico Williams
In reply to this post by Niall O'Reilly
On Mon, Jul 2, 2012 at 4:29 AM, Niall O'Reilly <[hidden email]> wrote:

>
> On 29 Jun 2012, at 23:58, Richard Hipp wrote:
>
>> But you know:  How often do people use BLOBs as keys?  What other SQL
>> engines other than SQLite even allow BLOBs as keys?  Are we trying to
>> optimize something that is never actually used?
>
>         For an IPAM application I have on my back burner, BLOB seems
>         a natural way to express IPv[46] addresses, ranges, and prefixes.
>         A bulkier alternative would be hexadecimal encoding as text.

That reminds me: it'd be nice to have a bit string type, since the
correct way to sort IPv4 CIDR blocks is as bit strings.  This is also
a proper way to sort IPv6 blocks.  Alternatively, it'd be nice to have
native IP address types in SQLite4, as otherwise one has to jump
through hoops to handle IP addresses properly.

Nico
--
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

IP address support (was: Consequences of lexicographic sorting of keys in SQLite4?)

Niall O'Reilly

On 2 Jul 2012, at 16:13, Nico Williams wrote:

> That reminds me: it'd be nice to have a bit string type, since the
> correct way to sort IPv4 CIDR blocks is as bit strings.

        Nice, definitely!

> This is also
> a proper way to sort IPv6 blocks.  Alternatively, it'd be nice to have
> native IP address types in SQLite4, as otherwise one has to jump
> through hoops to handle IP addresses properly.

        Bit strings would be more general.
        Native IP would remove a sometimes-asserted motivation for preferring
        PostgreSQL.

        As I see it, ranges, as well as single addresses and CIDR prefixes,
        need to be supported, perhaps like the Perl Net::IP module does.

        With some care over the encoding, a natural ordering arises which
        places nested prefixes, ranges, and individual addresses in the
        "right" order.  This would eliminate as much as possible of the
        hoop-jumping.

        I'll try to put together some examples of as illustrations.

        /Niall

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: IP address support (was: Consequences of lexicographic sorting of keys in SQLite4?)

Nico Williams
The key is to come up with a bit string encoding in bytes that is
suitable for use in table keys -- they have to sort correctly when
sorted lexicographically.  The encoding should be reasonably
efficient; one byte per-bit, for example, would be too inefficient
(though in a pinch much better than no bit string support).  I'm
thinking of a 7 bits / byte encoding, with the remaining indicating
the last byte of the encoding, which byte would contain up to 4 bits
of the bit string plus 3 bits to encode the length of the bit string
mod 8.

Nico
--
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: IP address support (was: Consequences of lexicographic sorting of keys in SQLite4?)

Nico Williams
Ah, if you encode any bit string as a BLOB such that it ends in 3 bits
that encode the length of the string mod 8, and with 7 - length of
string mod 8 preceding zero-valued bits then you get a result that
should sort [lexicographically] correctly, no?

So bit string would be a trivial extension of BLOB, and possibly no
special support should be required in SQLite (3 or 4).

Nico
--
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: IP address support (was: Consequences of lexicographic sorting of keys in SQLite4?)

Nico Williams
So an IPv4 CIDR block like 10.2.93.128/25 would encode as x'0A025D81'
and 10.2.93.128/26 as x'0A025D82', and so on, with 10.2.93.128/32
encoded as x'0A025D8000' (that's 5 bytes).  That is, IPv4 addresses
would require one more byte than usual.

I'm not sure that we can justify the extra complexity for a native bit
string type, but fwiw the literal syntax for it should probably be the
same as blob, with a /<length mod 8> at the end.  Still, native bit
string support would be handy.

Nico
--
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: IP address support (was: Consequences of lexicographic sorting of keys in SQLite4?)

Niall O'Reilly

On 2 Jul 2012, at 17:52, Nico Williams wrote:

> So an IPv4 CIDR block like 10.2.93.128/25 would encode as x'0A025D81'
> and 10.2.93.128/26 as x'0A025D82', and so on, with 10.2.93.128/32
> encoded as x'0A025D8000' (that's 5 bytes).  That is, IPv4 addresses
> would require one more byte than usual.

        You're missing some cases which I would find indispensible.
        I have a trip tomorrow.  I may be able to use the plane time
        to think about your examples above and to put together some
        complementary ones of my own.

        /Niall

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
12