Indexing multiple values per row

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Indexing multiple values per row

Jens Alfke-2
I'm wondering what the best way is to efficiently search for data values that can appear multiple times in a table row. SQLite indexes, even expression indexes, don't directly work for this because they obviously only index one value per row. Traditional relational-database design says to normalize the schema by storing the multiple values in separate rows, but what if the data to be indexed is JSON and you need to leave it in that form?

For example, let's say I have a table like
        patient_id: 12345
        temps: "[98.6, 99.1, 101.3, 100.0, 98.9]"
and I want to run queries on temperature data, like 'patients who've had a temperature above 101'. And I need better than O(n) performance.

In the past my project used map/reduce to support this, essentially implementing its own index system on top of SQLite tables. In this case it would create a table (patient_id integer primary key, temp number) and populate it by scanning the patient table. This can obviously be indexed easily, but updating the table before a query when the source table has changed is a pain in the butt.

I believe a SQL "materialized view" would do what I want, but SQLite doesn't support those; its views seem to be just syntactic sugar or macros around SELECT queries.

Other than that, my best idea so far is to simplify the map/reduce updating by adding triggers on the source table that will add & remove rows from the index table.

Is there any other clever way I could do this? (I've been using SQLite for 14 years and I still keep learning about more clever things it can do…)

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Indexing multiple values per row

Jay Kreibich

> On Mar 15, 2018, at 12:33 PM, Jens Alfke <[hidden email]> wrote:
>
> I'm wondering what the best way is to efficiently search for data values that can appear multiple times in a table row. SQLite indexes, even expression indexes, don't directly work for this because they obviously only index one value per row. Traditional relational-database design says to normalize the schema by storing the multiple values in separate rows, but what if the data to be indexed is JSON and you need to leave it in that form?
>
> For example, let's say I have a table like
> patient_id: 12345
> temps: "[98.6, 99.1, 101.3, 100.0, 98.9]"
> and I want to run queries on temperature data, like 'patients who've had a temperature above 101'. And I need better than O(n) performance.

Recognize the fact that if you’re storing data in a JSON string, to the database that is just one single value: a string.  The database has no knowledge and understanding of that value beyond the fact it is a string.  Asking the database to index or do something with the arbitrarily constructed sub-values you’ve created is outside the scope of what the database can do, because you’re storing data in a format outside the scope of the database.

If you do want to use database functions and queries to deal with this kind of thing, store the data in a way the database understands it and can use it… for example, a table that includes “patient_id, timestamp, temp” with “temp” being a SINGLE numeric value.  The patient_id column can be a foreign key back the full patient record.  Such a design is easier to insert, update, and just about everything else.


> In the past my project used map/reduce to support this, essentially implementing its own index system on top of SQLite tables. In this case it would create a table (patient_id integer primary key, temp number) and populate it by scanning the patient table. This can obviously be indexed easily, but updating the table before a query when the source table has changed is a pain in the butt.

Then get rid of the array and just store the values that way for everything.  Don’t have a “source table.”  It’s a very non-relational way to store data anyways.

  -j


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Indexing multiple values per row

Jens Alfke-2


> On Mar 15, 2018, at 11:27 AM, Jay Kreibich <[hidden email]> wrote:
>
> Recognize the fact that if you’re storing data in a JSON string, to the database that is just one single value: a string.

I am very well aware of that. What _I_ (and my co-workers) are implementing is a higher-level database layer* that _does_ store its data as JSON.

>  The database has no knowledge and understanding of that value beyond the fact it is a string.  Asking the database to index or do something with the arbitrarily constructed sub-values you’ve created is outside the scope of what the database can do, because you’re storing data in a format outside the scope of the database.

You may not be aware of SQLite's JSON1 extension? It does enable exactly this — you can easily index a property of a JSON value, for example, by creating an index on a json_extract() expression that retrieves a property from the JSON column.

> If you do want to use database functions and queries to deal with this kind of thing, store the data in a way the database understands it and can use it… for example, a table that includes “patient_id, timestamp, temp” with “temp” being a SINGLE numeric value.

Couchbase's users/customers do not store their data this way; they store it as JSON. Making this data queryable on mobile devices via SQLite is my job :) It’s working quite well, but it currently lacks the ability to do fast (indexed) searches over the contents of arrays and objects. Doing so will require some kind of transformation like what you’re talking about, but the JSON is still the source of truth, so the transformed tables need to mirror that.

> Don’t have a “source table.”  It’s a very non-relational way to store data anyways.

You say that like it’s a bad thing … Non-relational databases** are rather a big deal currently. Implementing one atop a relational database does sound a bit odd, but it actually makes a lot of sense for our use case.

—Jens

* https://www.couchbase.com/products/mobile
** https://en.wikipedia.org/wiki/NoSQL
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Indexing multiple values per row

Hick Gunter
In reply to this post by Jens Alfke-2
So in your application you don't care about when a temperature was measured? Creating a table temps (patient_id, timestamp, temperature) with an index on (timestamp,temperature) would yield a fast way to access patients with elevated temperatures within a time frame.

Other than that, using triggers is probably the easiest way to go.

-----Ursprüngliche Nachricht-----
Von: sqlite-users [mailto:[hidden email]] Im Auftrag von Jens Alfke
Gesendet: Donnerstag, 15. März 2018 18:34
An: SQLite mailing list <[hidden email]>
Betreff: [EXTERNAL] [sqlite] Indexing multiple values per row

I'm wondering what the best way is to efficiently search for data values that can appear multiple times in a table row. SQLite indexes, even expression indexes, don't directly work for this because they obviously only index one value per row. Traditional relational-database design says to normalize the schema by storing the multiple values in separate rows, but what if the data to be indexed is JSON and you need to leave it in that form?

For example, let's say I have a table like
        patient_id: 12345
        temps: "[98.6, 99.1, 101.3, 100.0, 98.9]"
and I want to run queries on temperature data, like 'patients who've had a temperature above 101'. And I need better than O(n) performance.

In the past my project used map/reduce to support this, essentially implementing its own index system on top of SQLite tables. In this case it would create a table (patient_id integer primary key, temp number) and populate it by scanning the patient table. This can obviously be indexed easily, but updating the table before a query when the source table has changed is a pain in the butt.

I believe a SQL "materialized view" would do what I want, but SQLite doesn't support those; its views seem to be just syntactic sugar or macros around SELECT queries.

Other than that, my best idea so far is to simplify the map/reduce updating by adding triggers on the source table that will add & remove rows from the index table.

Is there any other clever way I could do this? (I've been using SQLite for 14 years and I still keep learning about more clever things it can do…)

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


___________________________________________
 Gunter Hick | Software Engineer | Scientific Games International GmbH | Klitschgasse 2-4, A-1130 Vienna | FN 157284 a, HG Wien, DVR: 0430013 | (O) +43 1 80100 - 0

May be privileged. May be confidential. Please delete if not the addressee.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Indexing multiple values per row

wmertens
I have the same problem but luckily O(n) performance is fast enough for me
right now.

It seems to me that the only real option is to maintain a derived table,
the table could even be R*TREE to allow range queries.

Triggers seem to be the accepted way to derive tables, but I'm a little
afraid of them not always firing (sqlite versions, configuration glitches,
whatever reason), which would cause the table to be incorrect.

On Fri, Mar 16, 2018 at 9:24 AM Hick Gunter <[hidden email]> wrote:

> So in your application you don't care about when a temperature was
> measured? Creating a table temps (patient_id, timestamp, temperature) with
> an index on (timestamp,temperature) would yield a fast way to access
> patients with elevated temperatures within a time frame.
>
> Other than that, using triggers is probably the easiest way to go.
>
> -----Ursprüngliche Nachricht-----
> Von: sqlite-users [mailto:[hidden email]]
> Im Auftrag von Jens Alfke
> Gesendet: Donnerstag, 15. März 2018 18:34
> An: SQLite mailing list <[hidden email]>
> Betreff: [EXTERNAL] [sqlite] Indexing multiple values per row
>
> I'm wondering what the best way is to efficiently search for data values
> that can appear multiple times in a table row. SQLite indexes, even
> expression indexes, don't directly work for this because they obviously
> only index one value per row. Traditional relational-database design says
> to normalize the schema by storing the multiple values in separate rows,
> but what if the data to be indexed is JSON and you need to leave it in that
> form?
>
> For example, let's say I have a table like
>         patient_id: 12345
>         temps: "[98.6, 99.1, 101.3, 100.0, 98.9]"
> and I want to run queries on temperature data, like 'patients who've had a
> temperature above 101'. And I need better than O(n) performance.
>
> In the past my project used map/reduce to support this, essentially
> implementing its own index system on top of SQLite tables. In this case it
> would create a table (patient_id integer primary key, temp number) and
> populate it by scanning the patient table. This can obviously be indexed
> easily, but updating the table before a query when the source table has
> changed is a pain in the butt.
>
> I believe a SQL "materialized view" would do what I want, but SQLite
> doesn't support those; its views seem to be just syntactic sugar or macros
> around SELECT queries.
>
> Other than that, my best idea so far is to simplify the map/reduce
> updating by adding triggers on the source table that will add & remove rows
> from the index table.
>
> Is there any other clever way I could do this? (I've been using SQLite for
> 14 years and I still keep learning about more clever things it can do…)
>
> —Jens
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
>
> ___________________________________________
>  Gunter Hick | Software Engineer | Scientific Games International GmbH | Klitschgasse
> 2 <https://maps.google.com/?q=Klitschgasse+2&entry=gmail&source=g>-4,
> A-1130 Vienna | FN 157284 a, HG Wien, DVR: 0430013 | (O) +43 1 80100
> <+43%201%2080100> - 0
>
> May be privileged. May be confidential. Please delete if not the addressee.
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Indexing multiple values per row

Simon Slavin-3
On 16 Mar 2018, at 8:38am, Wout Mertens <[hidden email]> wrote:

> It seems to me that the only real option is to maintain a derived table,
> the table could even be R*TREE to allow range queries.

You can always do range queries on tables in SQLite.  Not sure what you mean by the above.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Indexing multiple values per row

wmertens
True, I mean in case there were multiple dimensions. I should write down
hidden thinking processes more.

On Fri, Mar 16, 2018 at 9:58 AM Simon Slavin <[hidden email]> wrote:

> On 16 Mar 2018, at 8:38am, Wout Mertens <[hidden email]> wrote:
>
> > It seems to me that the only real option is to maintain a derived table,
> > the table could even be R*TREE to allow range queries.
>
> You can always do range queries on tables in SQLite.  Not sure what you
> mean by the above.
>
> Simon.
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Indexing multiple values per row

Jens Alfke-2
In reply to this post by Hick Gunter


> On Mar 16, 2018, at 1:24 AM, Hick Gunter <[hidden email]> wrote:
>
> So in your application you don't care about when a temperature was measured? Creating a table temps (patient_id, timestamp, temperature) with an index on (timestamp,temperature) would yield a fast way to access patients with elevated temperatures within a time frame.

That was just an example. I work on a general-purpose database layered on SQLite, whose data/document format is JSON, and which allows arbitrary queries of that JSON. The question is how to efficiently index values that occur inside arrays (or objects), i.e. where there are multiple values to be indexed per row.

Yes, the solution involves creating a table as you described, but the table is derived from and dependent on the original JSON data, and needs to be kept up to date as that data changes. Triggers are the best way I know of to do that, but I'm open to other ideas.

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Indexing multiple values per row

Richard Hipp-3
On 3/16/18, Jens Alfke <[hidden email]> wrote:

> Yes, the solution involves creating a table as you described, but the table
> is derived from and dependent on the original JSON data, and needs to be
> kept up to date as that data changes. Triggers are the best way I know of to
> do that, but I'm open to other ideas.

Perhaps use FTS4 or FTS5 with a custom tokenizer that understands JSON?
--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users