How to index data based on custom comparisons?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

How to index data based on custom comparisons?

Lifepillar
I am implementing an extension for manipulating IEEE754 decimal
numbers. Numbers are stored as blobs using a standard encoding.
Numbers that are mathematically equal may have different
representations, (e.g., 1.0 may have mantissa 10 and exponent -1
while 1.00 may have mantissa 100 and exponent -2).

Since I am going to perform point and range queries on decimal
columns, I'd like to have them indexed. So far, I have been able to
create an expression index based on a decstr() function that
converts a decimal into a string, which I can use in queries like
the following:

     select decstr(d) from T where decstr(d) = '1.2345';

(Btw, if I use `like`, as in `decstr(d) like '1.2%'`, the index is
not used. Does it depend on my data, or can't the optimizer use an
index with a pattern matching condition?)

Anyway, string-based comparisons are limited. But, (correct me if
I am wrong), if I index the blob column directly, comparisons are
based on memcpy(), which in my case is not what I want. Is it
possible to create an index that somehow uses a custom comparison
function instead? E.g., I have a deccmp(x,y) function that returns
-1 if x<y, 0 if x=y, and 1 if x>y. Can I define an index based on
that?

Thanks,
Life.


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to index data based on custom comparisons?

Richard Hipp-3
On 12/13/17, Lifepillar <[hidden email]> wrote:
>
> if I index the blob column directly, comparisons are
> based on memcpy(), which in my case is not what I want. Is it
> possible to create an index that somehow uses a custom comparison
> function instead?

No.  SQLite always uses memcmp() to compare BLOBs.  You can add a
collating function for strings, but not for BLOBs.


--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to index data based on custom comparisons?

Simon Slavin-3
In reply to this post by Lifepillar


On 13 Dec 2017, at 8:34pm, Lifepillar <[hidden email]> wrote:

> But, (correct me if
> I am wrong), if I index the blob column directly, comparisons are
> based on memcpy(), which in my case is not what I want. Is it
> possible to create an index that somehow uses a custom comparison
> function instead? E.g., I have a deccmp(x,y) function that returns
> -1 if x<y, 0 if x=y, and 1 if x>y. Can I define an index based on
> that?

As Dr H wrote, it can’t be done.  Either store a normalised (numeric) version of the number, or store both the BLOB and a normalised version.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to index data based on custom comparisons?

Keith Medcalf
In reply to this post by Lifepillar

On Wednesday, 13 December, 2017 13:35, Lifepillar <[hidden email]> wrote:

>I am implementing an extension for manipulating IEEE754 decimal
>numbers. Numbers are stored as blobs using a standard encoding.
>Numbers that are mathematically equal may have different
>representations, (e.g., 1.0 may have mantissa 10 and exponent -1
>while 1.00 may have mantissa 100 and exponent -2).

You have stated something that is impossible, or at least self-contradictory.  Unless, of course, you are talking about the "decimal" formats of IEEE754-2008 and not the standard (far more common) "binary" formats.

You cannot have an IEEE754 (binary) number stored in denormalized format *except* in the circumstance where the exponent indicates that it is a denormalized number.  There are only two valid exponents to indicate that the number is denormalized, and all denormalized numbers are only used to represent numbers between +/- (0 and epsilon).  All other numbers stored in IEEE754 floating point format are required, by the standard, to be normalized. (That is, where the MSB is 1 and that 1 is not stored as part of the significand).

So in order for the numbers to be IEEE754 floating point, the number "1.0" (no matter the number of trailing 0's you choose to display) must always be stored with a mantissa of 0.5 and an exponent of 1.  Although a mantissa of 0.25 with exponent 2 evaluates also to the number 1.0, in IEEE754 format it must always have a mantissa of 0.5 and exponent of 1.

Note the above two paragraphs only apply to binary IEE754-2008 numbers.  These are the only kind of "floating point" presently understood by SQLite3.

However, if you are talking about the "decimal" IEEE754 then you can indeed have different representations of the same "value".  Some "values" can have about 800 different representations of the same value.  (Note that the solution below would work even if the blobs were arbitrary precision IBM GDAS floating point numbers, or any other kind of floating point number, pretty much).


To answer your question however, I would recommend that you consider writing a function that return the "value" in a supported format.  My personal recommendation would be IEEE754-2008 binary64 (that is, the standard double precision floating point format supported by SQLite3).  You would tag this function as being CONSTANT/DETERMINISTIC.  

You could then create an index on the result.

CREATE INDEX decimal64blob_to_binary64 ON MyTable (ConvertDecimal64toBinary64(binary64blob_field));

and when you search the index ala:

SELECT * FROM MyTable where ConvertDecimal64toBinary64(binary64blob_field) between 47.0 and 47.1;

you will use the index (I believe).  And you need not specify conversion to string format nor deal with the vagaries of strings.  You just have to deal with the standard binary floating point limitations.

On the other hand however if you do NOT need binary64 at all, then there was a minor change discussed a while back by someone else where you can "change" the default floating-point number format from binary64 to decimal64 and then compile your own custom version of SQLite3 ...

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.




_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to index data based on custom comparisons?

Lifepillar
On 14/12/2017 00:02, Keith Medcalf wrote:

>
> On Wednesday, 13 December, 2017 13:35, Lifepillar <[hidden email]> wrote:
>
>> I am implementing an extension for manipulating IEEE754 decimal
>> numbers. Numbers are stored as blobs using a standard encoding.
>> Numbers that are mathematically equal may have different
>> representations, (e.g., 1.0 may have mantissa 10 and exponent -1
>> while 1.00 may have mantissa 100 and exponent -2).
>
> You have stated something that is impossible, or at least self-contradictory.  Unless, of course, you are talking about the "decimal" formats of IEEE754-2008 and not the standard (far more common) "binary" formats.

Yes, I am talking about decimal IEEE754-2008 format, not binary. I
thought this would be clear from my use of the word "decimal" and from
my example.

> On the other hand however if you do NOT need binary64 at all, then there was a minor change discussed a while back by someone else where you can "change" the default floating-point number format from binary64 to decimal64 and then compile your own custom version of SQLite3 ...

Thanks for the tip! I will search for that discussion, although I would
rather not modify SQlite3 source code if I can find another solution.

I am not familiar with virtual tables yet, but I see that they are used,
for example, to implement Rtree indexes. Would it be feasible to
implement my own index structure as a virtual table and use it to index
a blob column in a standard table (or even just in the virtual table
itself)? I mean, would the optimizer be able to take advantage of such
an index?

(Sorry if my questions sound naive, I'm still pretty new to SQLite.)

Life.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to index data based on custom comparisons?

Richard Hipp-3
On 12/14/17, Lifepillar <[hidden email]> wrote:

> I am not familiar with virtual tables yet, but I see that they are used,
> for example, to implement Rtree indexes. Would it be feasible to
> implement my own index structure as a virtual table and use it to index
> a blob column in a standard table (or even just in the virtual table
> itself)?

That would be complicated.

A different idea.  Suppose you have two new UDFs:

ieee754dec(X):  Converts IEEE754-binary number X into IEEE754-decimal.
In other words it takes a "double" input and returns a "blob" output.

ieee754bin(Y):  Converts IEEE754-decimal blob Y and converts it into
IEEE754-binary.

Both routines are approximate because most IEEE754-binary values do
not have an exact equivalent IEEE754-decimal representation and vice
versa.  Your UDFs would need to find something very close.

Given these routines, you could then index your IEEE754-decimal
columns by doing an index on an expression using the new iee754bin()
function.
--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to index data based on custom comparisons?

Lifepillar
In reply to this post by Simon Slavin-3
On 13/12/2017 22:20, Simon Slavin wrote:

> On 13 Dec 2017, at 8:34pm, Lifepillar <[hidden email]> wrote:
>
>> But, (correct me if
>> I am wrong), if I index the blob column directly, comparisons are
>> based on memcpy(), which in my case is not what I want. Is it
>> possible to create an index that somehow uses a custom comparison
>> function instead? E.g., I have a deccmp(x,y) function that returns
>> -1 if x<y, 0 if x=y, and 1 if x>y. Can I define an index based on
>> that?
>
> As Dr H wrote, it can’t be done.  Either store a normalised (numeric) version of the number, or store both the BLOB and a normalised version.

Thanks for the suggestion. Storing a normalized version side by side
with the unnormalized number would work, but it would double the used
space.  Storing just the normalized number would lose information, though.

Life.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How to index data based on custom comparisons?

Lifepillar
In reply to this post by Richard Hipp-3
On 14/12/2017 13:14, Richard Hipp wrote:
> On 12/14/17, Lifepillar <[hidden email]> wrote:
>
>> I am not familiar with virtual tables yet, but I see that they are used,
>> for example, to implement Rtree indexes. Would it be feasible to
>> implement my own index structure as a virtual table and use it to index
>> a blob column in a standard table (or even just in the virtual table
>> itself)?
>
> That would be complicated.

So, it is possible :)

> A different idea.  Suppose you have two new UDFs:
>
> ieee754dec(X):  Converts IEEE754-binary number X into IEEE754-decimal.
> In other words it takes a "double" input and returns a "blob" output.
>
> ieee754bin(Y):  Converts IEEE754-decimal blob Y and converts it into
> IEEE754-binary.
>
> Both routines are approximate because most IEEE754-binary values do
> not have an exact equivalent IEEE754-decimal representation and vice
> versa.  Your UDFs would need to find something very close.
>
> Given these routines, you could then index your IEEE754-decimal
> columns by doing an index on an expression using the new iee754bin()
> function.

Thanks, that's another possibility to consider, although one typically
uses decimal values when exactness is needed; using your scheme
requires some care, I think, for example not to miss matches because of
approximations or to filter spurious matches away.

Somehow, I wish SQLite had something like PostgreSQL GiST indexes... ;)

Life.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Re: How to index data based on custom comparisons?

Hick Gunter
Select <wanted_fields> from blob_index idx cross join data_table dt on (idx.rowid = dt.rowid) where <index_conditions>;

Assuming that the rowid of the blob_index is generated from and identical to the rowid of the data table

-----Ursprüngliche Nachricht-----
Von: sqlite-users [mailto:[hidden email]] Im Auftrag von Lifepillar
Gesendet: Donnerstag, 14. Dezember 2017 13:52
An: [hidden email]
Betreff: [EXTERNAL] Re: [sqlite] How to index data based on custom comparisons?

On 14/12/2017 13:14, Richard Hipp wrote:
> On 12/14/17, Lifepillar <[hidden email]> wrote:
>
>> I am not familiar with virtual tables yet, but I see that they are
>> used, for example, to implement Rtree indexes. Would it be feasible
>> to implement my own index structure as a virtual table and use it to
>> index a blob column in a standard table (or even just in the virtual
>> table itself)?
>
> That would be complicated.

So, it is possible :)

> A different idea.  Suppose you have two new UDFs:
>
> ieee754dec(X):  Converts IEEE754-binary number X into IEEE754-decimal.
> In other words it takes a "double" input and returns a "blob" output.
>
> ieee754bin(Y):  Converts IEEE754-decimal blob Y and converts it into
> IEEE754-binary.
>
> Both routines are approximate because most IEEE754-binary values do
> not have an exact equivalent IEEE754-decimal representation and vice
> versa.  Your UDFs would need to find something very close.
>
> Given these routines, you could then index your IEEE754-decimal
> columns by doing an index on an expression using the new iee754bin()
> function.

Thanks, that's another possibility to consider, although one typically uses decimal values when exactness is needed; using your scheme requires some care, I think, for example not to miss matches because of approximations or to filter spurious matches away.

Somehow, I wish SQLite had something like PostgreSQL GiST indexes... ;)

Life.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


___________________________________________
 Gunter Hick | Software Engineer | Scientific Games International GmbH | Klitschgasse 2-4, A-1130 Vienna | FN 157284 a, HG Wien, DVR: 0430013 | (O) +43 1 80100 - 0

May be privileged. May be confidential. Please delete if not the addressee.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users