JSON1: queries on object keys

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

JSON1: queries on object keys

wmertens
Hi amazing list,

what would be the best way to answer these, given `CREATE TABLE foo(id
TEXT, json JSON);` and json is always a json object:

   - all rows with a given key bar
      - SELECT * FROM foo WHERE json_extract(json, '$.bar') IS NOT NULL;
   - all rows where there are only any of the given keys a,b in the object
      - SELECT * FROM foo WHERE json_remove(json,'$.a','$.b') ='{}';
   - all rows where there are all the given keys a,b and no others in the
   object
      - SELECT * FROM foo WHERE json_remove(json,'$.a','$.b') ='{}' and
      json_extract(json, '$.a') IS NOT NULL and json_extract(json,
'$.b') IS NOT
      NULL;

these queries seem pretty onerous to me, I hope there are better ways…

Wout.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: JSON1: queries on object keys

Dominique Devienne
On Tue, Mar 26, 2019 at 3:35 PM Wout Mertens <[hidden email]> wrote:

> Hi amazing list,
>
> what would be the best way to answer these, given `CREATE TABLE foo(id
> TEXT, json JSON);` and json is always a json object:
>
>    - all rows with a given key bar
>       - SELECT * FROM foo WHERE json_extract(json, '$.bar') IS NOT NULL;
>    - all rows where there are only any of the given keys a,b in the object
>       - SELECT * FROM foo WHERE json_remove(json,'$.a','$.b') ='{}';
>    - all rows where there are all the given keys a,b and no others in the
>    object
>       - SELECT * FROM foo WHERE json_remove(json,'$.a','$.b') ='{}' and
>       json_extract(json, '$.a') IS NOT NULL and json_extract(json,
> '$.b') IS NOT
>       NULL;
>
> these queries seem pretty onerous to me, I hope there are better ways…
>

Maybe using https://www.sqlite.org/json1.html#jtype might be faster?

I think it's possible that when you use several json_ functions, notably in
your WHERE
clause, the document is reparsed several times. I haven't looked at the
detail for a while, but I think
internally JSON1 can work on an intermediate binary representation that's
more efficient,
but only when call-chaining (depends on subtype APIs I seem to remember),
which doesn't help here.
Unfortunately it doesn't work with ephemeral tables AFAIK, when using a CTE
for example, to force the parsing
once, and then process the different json_ function off that more efficient
representation. --DD
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: JSON1: queries on object keys

Warren Young
In reply to this post by wmertens
On Mar 26, 2019, at 8:35 AM, Wout Mertens <[hidden email]> wrote:

>
> what would be the best way to answer these, given `CREATE TABLE foo(id
> TEXT, json JSON);` and json is always a json object:
>
>   - all rows with a given key bar
>      - SELECT * FROM foo WHERE json_extract(json, '$.bar') IS NOT NULL;
>   - all rows where there are only any of the given keys a,b in the object
>      - SELECT * FROM foo WHERE json_remove(json,'$.a','$.b') ='{}';
>   - all rows where there are all the given keys a,b and no others in the
>   object
>      - SELECT * FROM foo WHERE json_remove(json,'$.a','$.b') ='{}' and
>      json_extract(json, '$.a') IS NOT NULL and json_extract(json,
> '$.b') IS NOT
>      NULL;

The biggest problem with these isn’t the use of JSON per se, it’s that each one is unindexed, so they’ll all be a full table scan.  That makes your use of SQLite here scarcely better than storing all the data in a single JSON object, stored in a simple text file, querying it with something like jq.

If we’re talking about hundreds or a few thousand records, the parsing cost probably doesn’t matter, but if we’re talking millions or billions of records, it’s a pretty expensive way to go.

You should be able to create indices on expressions like “json_extract(json, '$.bar')”, but you’ll want to test it to find out whether the space it takes is worth the speed increase.

You could also think about partially denormalizing the data: on record insertion, pre-parse the JSON and extract the key fields as SQLite fields.

Even better would be to simply choose not to store all of your data in JSON form: keep the key fields in normal SQLite fields.  Then you can do indexed lookups on the actual records you require and do JSON extraction only on those records for each query.

> these queries seem pretty onerous to me

SQLite’s JSON extension is pretty quick.  A recent test I did here did a full table scan with JSON parsing on each of ~10000 records in about 80ms.

On the one hand, that’s impressively fast, but on the other, it’s up in human-scale time, so I ended up caching the query results for a few minutes to avoid making this expensive query too often.

Keep in mind that there’s nothing you can’t encode in plain old SQLite, in principle.  JSON is a luxury, not an absolute requirement.  Transforming the data to be more relational than hierarchical could be a very worthwhile investment, if you have enough records or a low enough latency budget.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: JSON1: queries on object keys

Jens Alfke-2


> On Mar 26, 2019, at 10:41 AM, Warren Young <[hidden email]> wrote:
>
> The biggest problem with these isn’t the use of JSON per se, it’s that each one is unindexed, so they’ll all be a full table scan.

Well, we don’t know that; the question was just about whether those are reasonable queries, not whether they can be indexed.

> You should be able to create indices on expressions like “json_extract(json, '$.bar')”, but you’ll want to test it to find out whether the space it takes is worth the speed increase.

The index takes exactly the same amount of space as it would if ‘bar’ were a regular SQL column.

> Even better would be to simply choose not to store all of your data in JSON form

It depends on the use case. The nice thing about storing data as JSON is that there’s total flexibility of the schema of that data; you can change the way the data is structured without having to alter the database. (Except to add new indexes as needed for performance.)

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: JSON1: queries on object keys

wmertens
To answer all emails in this thread:


   - I don't see how json_type can help, I want to query the keys of objects
   - Schemaless data is really nice to work with, the wrapper I use does
   allow putting parts of the JSON object into real columns but changing the
   production db schema all the time isn't nice
   - I suppose I was hoping for some performance discussion of the queries,
   perhaps how to implement it using json_each?
   - I'm thinking it would be nice if the JSON1 extension had a function to
   extract object keys as an array. I suppose something like `SELECT
   json_group_array(key) FROM foo,json_each(j) GROUP BY foo.rowid ORDER BY
   key;` is silly…
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: JSON1: queries on object keys

Warren Young
On Mar 28, 2019, at 4:15 AM, Wout Mertens <[hidden email]> wrote:
>
>   - I don't see how json_type can help

I don’t see “json_type” in this thread at all, other than this message.

>   - Schemaless data is really nice to work with

Sure, but it has a cost.  Unless you’re willing to give us a *lot* more information, you’ll have to decide if you’re willing and able to pay it, given your application constraints.

By “more information,” I mean to a level equivalent to “hire one of us as a consultant on your project.”  We’d need full schema info, number of rows, queries per second stats, time-to-answer budgets, representative sample data…

> the wrapper I use does
>   allow putting parts of the JSON object into real columns but changing the
>   production db schema all the time isn't nice

You only have to change the DB schema each time you discover something new you want to index.  If you don’t even know yet what you need to index, how can you expect us to tell you, especially given how thin the information you’ve provided is?

>   - I suppose I was hoping for some performance discussion of the queries,

I gave you performance information based on my data, in my schema, with my queries.  You’ve given us your queries but no data and a faux schema, so naturally no one’s dissected your queries’ performance.

Despite Jens’ objection, I’ll stand by my observation that since you don’t show any indices, we must assume that your queries are full-table scans, which in this case involves re-parsing each JSON object along the way.

>   perhaps how to implement it using json_each?

How would that solve any performance problem?  It’s still a full-table scan, lacking an index.

I guess this is coming from the common belief that it’s always faster to put the code in the database query, as opposed to doing it in the application code, but that’s only true when the DB has more information than you do so it can skip work, or because doing the processing at the DB level avoids one or more copies.  I’m not seeing that those apply here.

“Put it in the database” can also avoid a lot of IPC overhead when using a client-server DB, but that cost isn’t one that happens with plain SQLite.

>   - I'm thinking it would be nice if the JSON1 extension had a function to
>   extract object keys as an array.

If you don’t even know what keys you need to operate on until you see what’s available in each record, I’d say most of your processing should be at the application code level anyway.  And in that case, I’d tell you to just pull the JSON data as a string, parse it in your program, and iterate over it as necessary.

SQL is meant for declarative queries, where you say “I need thus-and-so data,” which you can specify precisely.  It sounds like you cannot specify that query precisely, so it should probably be done with application logic.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: JSON1: queries on object keys

wmertens
Hmmm right, valid points.

The queries I'm doing are on slices of data that are preselected using
indices, and then right now I'm post-filtering them in the application, and
I was just wondering if I could already do better filtering on the db side
before paying the serialization costs.

In fact, I'm not facing performance issues right now and I'm more idly
musing in order to better know the tools at my disposal.

I might also be lobbying for a JSON1 function that extracts keys from an
object, if that would make sense.

Wout.


On Thu, Mar 28, 2019 at 7:50 PM Warren Young <[hidden email]> wrote:

> On Mar 28, 2019, at 4:15 AM, Wout Mertens <[hidden email]> wrote:
> >
> >   - I don't see how json_type can help
>
> I don’t see “json_type” in this thread at all, other than this message.
>
> >   - Schemaless data is really nice to work with
>
> Sure, but it has a cost.  Unless you’re willing to give us a *lot* more
> information, you’ll have to decide if you’re willing and able to pay it,
> given your application constraints.
>
> By “more information,” I mean to a level equivalent to “hire one of us as
> a consultant on your project.”  We’d need full schema info, number of rows,
> queries per second stats, time-to-answer budgets, representative sample
> data…
>
> > the wrapper I use does
> >   allow putting parts of the JSON object into real columns but changing
> the
> >   production db schema all the time isn't nice
>
> You only have to change the DB schema each time you discover something new
> you want to index.  If you don’t even know yet what you need to index, how
> can you expect us to tell you, especially given how thin the information
> you’ve provided is?
>
> >   - I suppose I was hoping for some performance discussion of the
> queries,
>
> I gave you performance information based on my data, in my schema, with my
> queries.  You’ve given us your queries but no data and a faux schema, so
> naturally no one’s dissected your queries’ performance.
>
> Despite Jens’ objection, I’ll stand by my observation that since you don’t
> show any indices, we must assume that your queries are full-table scans,
> which in this case involves re-parsing each JSON object along the way.
>
> >   perhaps how to implement it using json_each?
>
> How would that solve any performance problem?  It’s still a full-table
> scan, lacking an index.
>
> I guess this is coming from the common belief that it’s always faster to
> put the code in the database query, as opposed to doing it in the
> application code, but that’s only true when the DB has more information
> than you do so it can skip work, or because doing the processing at the DB
> level avoids one or more copies.  I’m not seeing that those apply here.
>
> “Put it in the database” can also avoid a lot of IPC overhead when using a
> client-server DB, but that cost isn’t one that happens with plain SQLite.
>
> >   - I'm thinking it would be nice if the JSON1 extension had a function
> to
> >   extract object keys as an array.
>
> If you don’t even know what keys you need to operate on until you see
> what’s available in each record, I’d say most of your processing should be
> at the application code level anyway.  And in that case, I’d tell you to
> just pull the JSON data as a string, parse it in your program, and iterate
> over it as necessary.
>
> SQL is meant for declarative queries, where you say “I need thus-and-so
> data,” which you can specify precisely.  It sounds like you cannot specify
> that query precisely, so it should probably be done with application logic.
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users