Partial indexes on JSON properties?

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Partial indexes on JSON properties?

Jens Alfke-2
I’m experimenting with querying databases of JSON documents. These data-sets are schemaless and there’s no guarantee that they all have a common set of properties; in fact it’s common for them to have the equivalent of multiple ‘tables’ in the same data-set, i.e. groups of documents with distinct sets of common properties.

Unindexed queries just using SELECT with json_extract() are faster than I expected, and of course they become blazing fast when I create indexes on those json_extract() calls. Great stuff!

Here’s my issue/question: It’s common for a JSON property to exist in only a subset of the rows/documents. This seems like a perfect use case for partial indexes … except that the WHERE clause in a CREATE INDEX statement explicitly disallows function calls, so I can’t constrain the index to only the rows that contain the JSON property. Is this limitation something that might be lifted soon (now that functions can be identified as ‘pure’), or is it somehow a direct consequence of the way partial indexes work?

To give an example: Let’s say the data-set contains JSON documents describing both students and courses. There are 20,000 students but only 200 courses being taught. Courses have a “professor” property. I would like to query by professor, but only 1% of the rows contain that property, so unless I constrain the index it’s going to be 99% full of useless null values. I’d like to say:
        CREATE INDEX profs ON dataset (json_extract(doc, ‘$.professor’)) WHERE json_extract(doc, ‘$.professor’) IS NOT NULL
but SQLite rejects this because of the function call in the WHERE clause.

(Now, maybe I’m prematurely optimizing, and in reality all those null values don’t have much overhead, or are suppressed entirely; I don’t know.)

I can think of some workarounds for this, but they require changing the SQL schema, for instance adding a boolean ‘has_professor’ column that's accessible when indexing. This can be awkward because I’m writing general-purpose code that doesn’t know ahead of time what sort of JSON data-sets might be loaded into it, or what queries might be run. So it would have to alter the tables on the fly as the caller makes queries or requests indexing.

Any advice gratefully appreciated. (I’ve looked through the list archives to see if this has come up before, but I couldn’t find any threads.)

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Partial indexes on JSON properties?

Clemens Ladisch
Jens Alfke wrote:
> the WHERE clause in a CREATE INDEX statement explicitly disallows
> function calls, so I can’t constrain the index to only the rows that
> contain the JSON property. Is this limitation something that might be
> lifted soon (now that functions can be identified as ‘pure’), or is it
> somehow a direct consequence of the way partial indexes work?

Changing the function in any way (including not registering the
function) would essentially corrupt the index.

> CREATE INDEX profs ON dataset (json_extract(doc, ‘$.professor’)) WHERE json_extract(doc, ‘$.professor’) IS NOT NULL
>
> (Now, maybe I’m prematurely optimizing, and in reality all those null
> values don’t have much overhead

If only 10% of your rows have a professor, the first 90% of the index
entries will be NULL.  This is probably not a performance problem when
searching in a B-tree (you can test this with a table with 'real'
columns).  Whether the storage overhead matters is something you have
to decide yourself.


Regards,
Clemens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Partial indexes on JSON properties?

Jens Alfke-2

> On Oct 2, 2016, at 6:20 AM, Clemens Ladisch <[hidden email]> wrote:
>
> Changing the function in any way (including not registering the
> function) would essentially corrupt the index.

Well, the same can be said of using a custom collation function, which has been supported since 3.0; or of using a function in the expression being indexed, which is supported since 3.9.

- “Changing the function” is covered by the SQLITE_DETERMINISTIC flag, which is a promise that the function is pure, i.e. always produces the same output given the same input. If that guarantee were to fail, you’d need to drop and re-create the index.
- “Not registering the function” is handled by SQLite — any query/command that would involve calling an unregistered function will instead return an error. (I already run into this in the current version of my library, which uses a custom collator, when I try to query databases with the `sqlite3` tool.)

So I don’t see this as a reason not to support (deterministic) functions in the WHERE clause.

> If only 10% of your rows have a professor, the first 90% of the index
> entries will be NULL.  This is probably not a performance problem when
> searching in a B-tree (you can test this with a table with 'real'
> columns).  Whether the storage overhead matters is something you have
> to decide yourself.

It’s 1%, not 10%, in my example; but in reality I can’t predict or control what types of data sets might be used by customers.

I don’t know the details of SQLite’s b-tree, but I’d guess that the root node would have one child pointer for ‘null’ values, with the rest being non-null, so there’s little overhead in lookup time. As for storage overhead, hopefully if the key is null and the table has a rowid, the node would be small, on the order of 10 bytes or so … ?

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Partial indexes on JSON properties?

Clemens Ladisch
Jens Alfke wrote:
> if the key is null and the table has a rowid, the node would be small, on the order of 10 bytes or so … ?

Typically less than 10 bytes.


Regards,
Clemens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Partial indexes on JSON properties?

Richard Hipp-3
In reply to this post by Jens Alfke-2
On 10/1/16, Jens Alfke <[hidden email]> wrote:
> the WHERE clause in a CREATE INDEX statement
> explicitly disallows function calls.... Is this limitation something that
> might be lifted soon

Deterministic SQL functions are now allowed in partial index WHERE
clauses, as of a few minutes ago. The current "Prerelease Snapshot"
(https://www.sqlite.org/download.html) supports this capability.
Please try it out and report any problems.  Thanks.


--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Partial indexes on JSON properties?

Jens Alfke-2

> On Oct 3, 2016, at 11:29 AM, Richard Hipp <[hidden email]> wrote:
>
> Deterministic SQL functions are now allowed in partial index WHERE
> clauses, as of a few minutes ago. The current "Prerelease Snapshot"
> (https://www.sqlite.org/download.html <https://www.sqlite.org/download.html>) supports this capability.
> Please try it out and report any problems.  Thanks.

Thanks! I was next going to ask whether this is feasible to implement, and where I should file a feature request, but this is even better :)

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Partial indexes on JSON properties?

Keith Medcalf
In reply to this post by Richard Hipp-3

On Monday, 3 October, 2016 12:30, Richard Hipp wrote:
> On 10/1/16, Jens Alfke <[hidden email]> wrote:
> > the WHERE clause in a CREATE INDEX statement
> > explicitly disallows function calls.... Is this limitation something
> > that might be lifted soon
 
> Deterministic SQL functions are now allowed in partial index WHERE
> clauses, as of a few minutes ago. The current "Prerelease Snapshot"
> (https://www.sqlite.org/download.html) supports this capability.
> Please try it out and report any problems.  Thanks.

This raises another question.  Is there any way to mark a function in-between volatile and deterministic?  Currently if the deterministic flag is not set the optimizer assumes that the function is truly volatile (it is called every reference, even for duplicated arguments).  What about functions that are deterministic (enough) for a statement/transaction but are not deterministic enough to be used in an index?

An example would be a function which calls the Windows CheckTokenMembership function.  This function returns whether or current process token contains the provided SID (group/user).  This (and other like it) functions are deterministic enough for the query optimizer to treat as deterministic, but volatile enough that they should not be used to build anything persistent (such as in a conditional index).





_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Partial indexes on JSON properties?

Richard Hipp-3
On 10/4/16, Keith Medcalf <[hidden email]> wrote:

> This raises another question.  Is there any way to mark a function
> in-between volatile and deterministic?  Currently if the deterministic flag
> is not set the optimizer assumes that the function is truly volatile (it is
> called every reference, even for duplicated arguments).  What about
> functions that are deterministic (enough) for a statement/transaction but
> are not deterministic enough to be used in an index?
>
> An example would be a function which calls the Windows CheckTokenMembership
> function.  This function returns whether or current process token contains
> the provided SID (group/user).  This (and other like it) functions are
> deterministic enough for the query optimizer to treat as deterministic, but
> volatile enough that they should not be used to build anything persistent
> (such as in a conditional index).
>

Another example is sqlite_version() which is always the same for any
database connection, but might change when you upgrade your
sqlite3.dll, and so cannot be used in an index expression or in the
WHERE clause of a partial index.

Such functions are internally identified as "Slow Change" functions
using the internal flag SQLITE_FUNC_SLOCHNG.  But that flag cannot
currently be set using the public API.

--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users