Memoization in sqlite json1 functions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Memoization in sqlite json1 functions

Domingo Alvarez Duarte
Hello Richard !

I noticed that sqlite do not use any memoization in json1 functions.

For example jsonExtractFunc and others parse the json string every time
it's called even when the json string is the same.

minimal example : "select json_extract(json, '$.name') name,
json_extract(json, '$.address') name from some_table;"

Could be possible to have some kind of memoization as a general option
for any sqlite functions ?

Cheers !


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Memoization in sqlite json1 functions

Deon Brewis
That would be nice.

I've resorted to a few horrible hacks like this:

SELECT parse(data), extract("name"), extract("address"), release(data) FROM some_table;

It works, but it relies on LTR parsing of arguments (which it does now, but I seriously doubt is a guarantee), as well as global/thread-local variables.

Though I can still live with that one - it works.

What I would like to see - is if you have an indexed expression like so:
create index some_index on some_table( json_extract(json, '$.name') );

And you run:
select json_extract(json, '$.name') from some_table indexed by some_index;

That it returns the resultant value that is already stored in the index, rather than re-executing the expression.
 
- Deon

-----Original Message-----
From: sqlite-users [mailto:[hidden email]] On Behalf Of Domingo Alvarez Duarte
Sent: Wednesday, March 22, 2017 5:04 PM
To: SQLite mailing list <[hidden email]>
Subject: [sqlite] Memoization in sqlite json1 functions

Hello Richard !

I noticed that sqlite do not use any memoization in json1 functions.

For example jsonExtractFunc and others parse the json string every time it's called even when the json string is the same.

minimal example : "select json_extract(json, '$.name') name, json_extract(json, '$.address') name from some_table;"

Could be possible to have some kind of memoization as a general option for any sqlite functions ?

Cheers !


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Memoization in sqlite json1 functions

Richard Hipp-3
In reply to this post by Domingo Alvarez Duarte
On 3/22/17, Domingo Alvarez Duarte <[hidden email]> wrote:

> Hello Richard !
>
> I noticed that sqlite do not use any memoization in json1 functions.
>
> For example jsonExtractFunc and others parse the json string every time
> it's called even when the json string is the same.
>
> minimal example : "select json_extract(json, '$.name') name,
> json_extract(json, '$.address') name from some_table;"
>
> Could be possible to have some kind of memoization as a general option
> for any sqlite functions ?'

In paragraph 3.0 of https://www.sqlite.org/json1.html:  "All json1
functions currently throw an error if any of their arguments are BLOBs
because BLOBs are reserved for a future enhancement in which BLOBs
will store the binary encoding for JSON."

But let me ask this:  Have you actually measured a performance
problem?  Or are you just assuming that because the implementation
parses the JSON anew each time it see it that it must therefore be
inefficient?

--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Memoization in sqlite json1 functions

Domingo Alvarez Duarte
Hello Richard !

Yes I do see "because the implementation parsing the JSON anew each
time" and this is a recurring pattern in some sqlite functions, would be
nice if we could have a "session/query/row" storage space to store query
information that can be  reused, for example on the json functions we
could reuse an already parsed json field several times, another example
we could have session/query/row variables.

We could have something like "sqlite3_set_auxdata" but with granularity
for row/query/session and as a bonus would be nice to have session
variables like https://dev.mysql.com/doc/refman/5.7/en/user-variables.html .

Cheers !

On 23/03/17 08:30, Richard Hipp wrote:

> On 3/22/17, Domingo Alvarez Duarte <[hidden email]> wrote:
>> Hello Richard !
>>
>> I noticed that sqlite do not use any memoization in json1 functions.
>>
>> For example jsonExtractFunc and others parse the json string every time
>> it's called even when the json string is the same.
>>
>> minimal example : "select json_extract(json, '$.name') name,
>> json_extract(json, '$.address') name from some_table;"
>>
>> Could be possible to have some kind of memoization as a general option
>> for any sqlite functions ?'
> In paragraph 3.0 of https://www.sqlite.org/json1.html:  "All json1
> functions currently throw an error if any of their arguments are BLOBs
> because BLOBs are reserved for a future enhancement in which BLOBs
> will store the binary encoding for JSON."
>
> But let me ask this:  Have you actually measured a performance
> problem?  Or are you just assuming that because the implementation
> parses the JSON anew each time it see it that it must therefore be
> inefficient?
>

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Memoization in sqlite json1 functions

Scott Hess
What is the goal, though?  Your app knows your data and performance needs,
so if you find yourself running the same query to read off the same result
set over and over, change your app to do the right thing.

If it's somehow more convenient to have SQLite do it, populate a temporary
table and pull the data from that, which doesn't require any new API at all.

-scott


On Thu, Mar 23, 2017 at 6:23 AM, Domingo Alvarez Duarte <[hidden email]>
wrote:

> Hello Richard !
>
> Yes I do see "because the implementation parsing the JSON anew each time"
> and this is a recurring pattern in some sqlite functions, would be nice if
> we could have a "session/query/row" storage space to store query
> information that can be  reused, for example on the json functions we could
> reuse an already parsed json field several times, another example we could
> have session/query/row variables.
>
> We could have something like "sqlite3_set_auxdata" but with granularity
> for row/query/session and as a bonus would be nice to have session
> variables like https://dev.mysql.com/doc/refman/5.7/en/user-variables.html
> .
>
> Cheers !
>
> On 23/03/17 08:30, Richard Hipp wrote:
>
>> On 3/22/17, Domingo Alvarez Duarte <[hidden email]> wrote:
>>
>>> Hello Richard !
>>>
>>> I noticed that sqlite do not use any memoization in json1 functions.
>>>
>>> For example jsonExtractFunc and others parse the json string every time
>>> it's called even when the json string is the same.
>>>
>>> minimal example : "select json_extract(json, '$.name') name,
>>> json_extract(json, '$.address') name from some_table;"
>>>
>>> Could be possible to have some kind of memoization as a general option
>>> for any sqlite functions ?'
>>>
>> In paragraph 3.0 of https://www.sqlite.org/json1.html:  "All json1
>> functions currently throw an error if any of their arguments are BLOBs
>> because BLOBs are reserved for a future enhancement in which BLOBs
>> will store the binary encoding for JSON."
>>
>> But let me ask this:  Have you actually measured a performance
>> problem?  Or are you just assuming that because the implementation
>> parses the JSON anew each time it see it that it must therefore be
>> inefficient?
>>
>>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Memoization in sqlite json1 functions

Jens Alfke-2
In reply to this post by Richard Hipp-3

> On Mar 23, 2017, at 4:30 AM, Richard Hipp <[hidden email]> wrote:
>
> BLOBs are reserved for a future enhancement in which BLOBs will store the binary encoding for JSON.

I’ve been down this road. There are a number of JSON-compatible binary encoding formats, but most of them don’t save much time, because (a) they’re not _that_ much faster to parse, (b) reading JSON tends to be dominated by allocating an object tree, not by the actual parsing, and (c) usually you have to parse the entire data even if you only want to use one piece of it [as in a query].

I ended up designing and implementing a new format, called Fleece*. Its advantage is that it doesn’t require parsing or even memory allocation. The internal structure is already an object tree, except that it uses compressed relative offsets instead of pointers. This means that internal pointers into Fleece data can be used directly as the data objects.

In my current project** we’re storing Fleece in SQLite instead of JSON, with a modified version of the json1 extension to make it accessible in queries. It works very well. The Fleece equivalent of json_extract( ) just does some pointer manipulation to find the root object in the blob, then further lookups to jump to each nested object in the path.

—Jens

* https://github.com/couchbaselabs/fleece <https://github.com/couchbaselabs/fleece>
** https://github.com/couchbase/couchbase-lite-core
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Memoization in sqlite json1 functions

Domingo Alvarez Duarte
Hello Jens !

Nice to know this project, I'll look at it.

Cheers !


On 23/03/17 15:05, Jens Alfke wrote:

>> On Mar 23, 2017, at 4:30 AM, Richard Hipp <[hidden email]> wrote:
>>
>> BLOBs are reserved for a future enhancement in which BLOBs will store the binary encoding for JSON.
> I’ve been down this road. There are a number of JSON-compatible binary encoding formats, but most of them don’t save much time, because (a) they’re not _that_ much faster to parse, (b) reading JSON tends to be dominated by allocating an object tree, not by the actual parsing, and (c) usually you have to parse the entire data even if you only want to use one piece of it [as in a query].
>
> I ended up designing and implementing a new format, called Fleece*. Its advantage is that it doesn’t require parsing or even memory allocation. The internal structure is already an object tree, except that it uses compressed relative offsets instead of pointers. This means that internal pointers into Fleece data can be used directly as the data objects.
>
> In my current project** we’re storing Fleece in SQLite instead of JSON, with a modified version of the json1 extension to make it accessible in queries. It works very well. The Fleece equivalent of json_extract( ) just does some pointer manipulation to find the root object in the blob, then further lookups to jump to each nested object in the path.
>
> —Jens
>
> * https://github.com/couchbaselabs/fleece <https://github.com/couchbaselabs/fleece>
> ** https://github.com/couchbase/couchbase-lite-core
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Memoization in sqlite json1 functions

Deon Brewis
In reply to this post by Jens Alfke-2
It has more to do with how you parse JSON - if you want to build it into an object tree, sure, then you're obviously dead in the water with any kind of json.

If you however can use a forward-only push or pull parser like a SAX or StAX parse, it's a different story. I'm using a StAX-like pull parser for a binary json-ish internal format we have, and reading & parsing through it is on par with the performance of reading equivalent SQLITE columns directly (apart from the expression indexed covered value lookup scenario... grumble... grumble...).

So binary-json like formats can perform well - you just can't use an object tree to parse them.

That obviously implies if you do random-access into a structure you have to keep reparsing it (which is where Memoization would be nice). However, CPU caches are better at reading continues data streams in forward-only fashion than they are with pointers, so forward-only pull parsers, even when you have to repeat the entire parse, are often faster than the math behind it suggests. (In the way that scanning an unsorted vector in O(n) is often times faster than searching O(log n) through a map).

Besides, in 99% of cases my users take the outcome from a json parse and just store the results into a C++ data structure anyway. In that case the intermediary object tree is just a throwaway and you may as well have built the C++ structure up using a pull or push parser. It's very like extra work, and it's way... way... faster.

- Deon

-----Original Message-----
From: sqlite-users [mailto:[hidden email]] On Behalf Of Jens Alfke
Sent: Thursday, March 23, 2017 11:05 AM
To: SQLite mailing list <[hidden email]>
Subject: Re: [sqlite] Memoization in sqlite json1 functions


> On Mar 23, 2017, at 4:30 AM, Richard Hipp <[hidden email]> wrote:
>
> BLOBs are reserved for a future enhancement in which BLOBs will store the binary encoding for JSON.

I’ve been down this road. There are a number of JSON-compatible binary encoding formats, but most of them don’t save much time, because (a) they’re not _that_ much faster to parse, (b) reading JSON tends to be dominated by allocating an object tree, not by the actual parsing, and (c) usually you have to parse the entire data even if you only want to use one piece of it [as in a query].

I ended up designing and implementing a new format, called Fleece*. Its advantage is that it doesn’t require parsing or even memory allocation. The internal structure is already an object tree, except that it uses compressed relative offsets instead of pointers. This means that internal pointers into Fleece data can be used directly as the data objects.

In my current project** we’re storing Fleece in SQLite instead of JSON, with a modified version of the json1 extension to make it accessible in queries. It works very well. The Fleece equivalent of json_extract( ) just does some pointer manipulation to find the root object in the blob, then further lookups to jump to each nested object in the path.

—Jens

* https://github.com/couchbaselabs/fleece <https://github.com/couchbaselabs/fleece>
** https://github.com/couchbase/couchbase-lite-core
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Memoization in sqlite json1 functions

Jens Alfke-2

> On Mar 23, 2017, at 3:17 PM, Deon Brewis <[hidden email]> wrote:
>
> If you however can use a forward-only push or pull parser like a SAX or StAX parse, it's a different story. I'm using a StAX-like pull parser for a binary json-ish internal format we have, and reading & parsing through it is on par with the performance of reading equivalent SQLITE columns directly

I agree that’s a lot faster, but you’re still looking at O(n) lookup time in an array or dictionary. And the proportion constant gets worse the bigger the document is, since jumping to the next item involves parsing through all of the nested items in that collection.

> That obviously implies if you do random-access into a structure you have to keep reparsing it (which is where Memoization would be nice). However, CPU caches are better at reading continues data streams in forward-only fashion than they are with pointers, so forward-only pull parsers, even when you have to repeat the entire parse, are often faster than the math behind it suggests.

It could be; my knowledge of optimization gets tenuous when it comes to down-to-the-metal areas like CPU caching. But for large data, you run the risk of blowing out the cache traversing it. And if the data is memory-mapped, it becomes hugely faster to skip right to the relevant page instead of faulting in every page ahead of it.

> Besides, in 99% of cases my users take the outcome from a json parse and just store the results into a C++ data structure anyway. In that case the intermediary object tree is just a throwaway and you may as well have built the C++ structure up using a pull or push parser.

In Fleece I put a lot of effort into making the C++ API nice to use, so that I don’t have to have any other data structure. That's worked well so far.

—Jens

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Memoization in sqlite json1 functions

Deon Brewis
> It could be; my knowledge of optimization gets tenuous when it comes to down-to-the-metal areas like CPU caching. But for large data, you run the risk of blowing out the cache traversing it. And if the data is memory-mapped, it becomes hugely faster to skip right to the relevant page instead of faulting in every page ahead of it.

Yeah ok, if you take I/O hits then things like memory pre-fetching makes zero difference. We're more in the business of "You take a page fault" == "You buy more memory". Different level of performance requirements. (And glad that SQLITE works well for both of us).


> In Fleece I put a lot of effort into making the C++ API nice to use, so that I don’t have to have any other data structure. That's worked well so far.

Strong typing?


- Deon

-----Original Message-----
From: sqlite-users [mailto:[hidden email]] On Behalf Of Jens Alfke
Sent: Thursday, March 23, 2017 6:09 PM
To: SQLite mailing list <[hidden email]>
Subject: Re: [sqlite] Memoization in sqlite json1 functions


> On Mar 23, 2017, at 3:17 PM, Deon Brewis <[hidden email]> wrote:
>
> If you however can use a forward-only push or pull parser like a SAX or StAX parse, it's a different story. I'm using a StAX-like pull parser for a binary json-ish internal format we have, and reading & parsing through it is on par with the performance of reading equivalent SQLITE columns directly

I agree that’s a lot faster, but you’re still looking at O(n) lookup time in an array or dictionary. And the proportion constant gets worse the bigger the document is, since jumping to the next item involves parsing through all of the nested items in that collection.

> That obviously implies if you do random-access into a structure you have to keep reparsing it (which is where Memoization would be nice). However, CPU caches are better at reading continues data streams in forward-only fashion than they are with pointers, so forward-only pull parsers, even when you have to repeat the entire parse, are often faster than the math behind it suggests.

It could be; my knowledge of optimization gets tenuous when it comes to down-to-the-metal areas like CPU caching. But for large data, you run the risk of blowing out the cache traversing it. And if the data is memory-mapped, it becomes hugely faster to skip right to the relevant page instead of faulting in every page ahead of it.

> Besides, in 99% of cases my users take the outcome from a json parse and just store the results into a C++ data structure anyway. In that case the intermediary object tree is just a throwaway and you may as well have built the C++ structure up using a pull or push parser.

In Fleece I put a lot of effort into making the C++ API nice to use, so that I don’t have to have any other data structure. That's worked well so far.

—Jens

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Memoization in sqlite json1 functions

Jens Alfke-2

> On Mar 24, 2017, at 4:48 PM, Deon Brewis <[hidden email]> wrote:
>
> Yeah ok, if you take I/O hits then things like memory pre-fetching makes zero difference. We're more in the business of "You take a page fault" == "You buy more memory". Different level of performance requirements. (And glad that SQLITE works well for both of us).

Page faults aren’t necessarily due to swapping. If you read a memory-mapped file, the first access to any page will fault to disk. Since SQLite supports memory-mapping, I’d assume this can occur during a query — the column data returned by sqlite may point into mmap’ed pages. (Is that correct?)

In that situation, if a blob value is significantly larger than one page, the difference between scanning everything up to byte offset x, vs. reading a few bytes at the beginning and then jumping directly to x, can be significant.

>> In Fleece I put a lot of effort into making the C++ API nice to use, so that I don’t have to have any other data structure. That's worked well so far.
>
> Strong typing?

That only happens at higher levels in our stack. At the level I mostly work on, this is a document-oriented, schemaless data store. Up above there’s the option to use a data-modeling layer that binds document properties to platform object properties with strong typing (though we don’t support C++.) But even those bindings can be considerably more efficient than the typical JSON object tree — you’re not allocating string objects for dictionary keys, nor number objects for numeric/boolean values. It’s also generally faster to populate instance variables of objects (usually at fixed offsets) vs. adding key/value pairs to a hash table.

[I think we’re getting off topic, but I’ll be happy to continue offline.]

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Memoization in sqlite json1 functions

Deon Brewis
> Page faults aren’t necessarily due to swapping. If you read a memory-mapped file, the first access to any page will fault to disk. Since SQLite supports memory-mapping, I’d assume this can occur during a query — the column data returned by sqlite may point into mmap’ed pages. (Is that correct?)

> In that situation, if a blob value is significantly larger than one page, the difference between scanning everything up to byte offset x, vs. reading a few bytes at the beginning and then jumping directly to x, can be significant.

I don't think SQLITE will return a pointer directly into an mmap page offset, and it definitely can't in the case of a column spilled into an overflow page - the data will be discontiguous.

Even something like the blob API's don't help, SQLITE overflow pages link to each other at the beginning of each page. So you can't e.g. load the 5th page without also loading page 1 through 4. Since SQLITE pages are generally smaller than disk pages, it means faulting in all the disk pages on the way.

Anyway, by faulting I meant 'paging back out due to memory pressure'. Sorry, should have been more clear - knew that one was going to bite me as soon as I hit send...

How do you take a thread offline here?

- Deon

-----Original Message-----
From: sqlite-users [mailto:[hidden email]] On Behalf Of Jens Alfke
Sent: Saturday, March 25, 2017 10:35 AM
To: SQLite mailing list <[hidden email]>
Subject: Re: [sqlite] Memoization in sqlite json1 functions


> On Mar 24, 2017, at 4:48 PM, Deon Brewis <[hidden email]> wrote:
>
> Yeah ok, if you take I/O hits then things like memory pre-fetching makes zero difference. We're more in the business of "You take a page fault" == "You buy more memory". Different level of performance requirements. (And glad that SQLITE works well for both of us).

Page faults aren’t necessarily due to swapping. If you read a memory-mapped file, the first access to any page will fault to disk. Since SQLite supports memory-mapping, I’d assume this can occur during a query — the column data returned by sqlite may point into mmap’ed pages. (Is that correct?)

In that situation, if a blob value is significantly larger than one page, the difference between scanning everything up to byte offset x, vs. reading a few bytes at the beginning and then jumping directly to x, can be significant.

>> In Fleece I put a lot of effort into making the C++ API nice to use, so that I don’t have to have any other data structure. That's worked well so far.
>
> Strong typing?

That only happens at higher levels in our stack. At the level I mostly work on, this is a document-oriented, schemaless data store. Up above there’s the option to use a data-modeling layer that binds document properties to platform object properties with strong typing (though we don’t support C++.) But even those bindings can be considerably more efficient than the typical JSON object tree — you’re not allocating string objects for dictionary keys, nor number objects for numeric/boolean values. It’s also generally faster to populate instance variables of objects (usually at fixed offsets) vs. adding key/value pairs to a hash table.

[I think we’re getting off topic, but I’ll be happy to continue offline.]

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Loading...