more efficient JSON encoding: idle musing

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

more efficient JSON encoding: idle musing

wmertens
Hi,

I use SQLite as a MaybeSQL store, mixing fixed columns with schemaless JSON
columns. It's really great.

In JavaScript, objects are key-value collections with unique keys, where
the order of the keys is important. Most JSVMs store them as a pointer to a
layout and then the values. The layout lists the keys in order (and
possibly the types, to get byte-perfect layouts). If you delete a key from
an object, it generates a new layout.

I was wondering if the JSON extension could not do the same thing: for each
table, keep a hidden stash of object layouts, and store the values as
sqlite primitives. (you'd be able to disable this, in case the layouts
rarely repeat)

This way:

   - it's more space efficient, since they keys are only stored once per
   layout
   - loading is faster, because the values are stored as their primitive
   type
   - querying can go faster

Queries can go faster, because a query like `where json_extract(json,
'$.foo') = 'bar'` can first check the layouts to see which ones apply,
allowing to skip other layouts, and then quickly find the value to test
thanks to the binary encoding

You could also allow an optimization that makes key order unimportant,
reducing the number of layouts.

So, smaller and faster. Thoughts?

Wout.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: more efficient JSON encoding: idle musing

Warren Young
On Feb 21, 2020, at 5:20 AM, Wout Mertens <[hidden email]> wrote:
>
> In JavaScript, objects are key-value collections with unique keys, where
> the order of the keys is important.

ECMAScript §13.7.5.15 (2019 edition) says, "The mechanics and order of enumerating the properties is not specified but must conform to the rules specified below.”

You can read the rules if you like, but it doesn’t say that every JavaScript give the same ordering:

    https://www.ecma-international.org/ecma-262/10.0/index.html#sec-enumerate-object-properties

The JSON spec then says, “The JSON syntax...does not assign any significance to the ordering of name/value pairs."

    http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf

> Most JSVMs

I’m going to take a wild guess that “Most” here means “a whole bunch of different browsers and server-side JS stacks all using V8.”

> Queries can go faster, because a query like `where json_extract(json,
> '$.foo') = 'bar'` can first check the layouts to see which ones apply,

SQLite’s JSON1 extension is a storage and query mechanism, not a run-time object system. I se that things like json_remove() exist, but my assumption is that 99.manynines% of the time, objects are stored, retrieved, and queried without being modified at the SQLite level, if at all.

Therefore, 99.manynines% of the time, there is only one “layout.”

To the extent that that is true, I don’t see how the rest of your proposal matters.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: more efficient JSON encoding: idle musing

wmertens
On Fri, Feb 21, 2020 at 2:37 PM Warren Young <[hidden email]> wrote:
>
> On Feb 21, 2020, at 5:20 AM, Wout Mertens <[hidden email]> wrote:
> > Queries can go faster, because a query like `where json_extract(json,
> > '$.foo') = 'bar'` can first check the layouts to see which ones apply,
>
> SQLite’s JSON1 extension is a storage and query mechanism, not a run-time object system. I se that things like json_remove() exist, but my assumption is that 99.manynines% of the time, objects are stored, retrieved, and queried without being modified at the SQLite level, if at all.
>
> Therefore, 99.manynines% of the time, there is only one “layout.”

Hmm, that is not what I meant. The idea is that upon storing the JSON
data, the JSON1 extension parses it, extracts the layouts recursively,
stores them when they are not known yet, and then only stores the
values in the binary format with the layout identifiers.

So yes, somewhat more work than storing and retrieving a plain string.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: more efficient JSON encoding: idle musing

Richard Hipp-3
On 2/21/20, Wout Mertens <[hidden email]> wrote:
> The idea is that upon storing the JSON
> data, the JSON1 extension parses it, extracts the layouts recursively,
> stores them when they are not known yet, and then only stores the
> values in the binary format with the layout identifiers.

I experimented with a number of similar ideas for storing JSON when I
was first designing the JSON components for SQLite.  I was never able
to find anything that was as fast or as compact as just storing the
original JSON text.  But I could have overlooked something.  If you
have example code for a mechanism that is more space efficient and/or
faster, please share it with us.

--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: more efficient JSON encoding: idle musing

wmertens
On Fri, Feb 21, 2020 at 3:03 PM Richard Hipp <[hidden email]> wrote:
> If you
> have example code for a mechanism that is more space efficient and/or
> faster, please share it with us.

I'll see if I can prototype something in JS - I'd be keeping the
layouts in a helper table, and I wouldn't be storing the values in
binary but it'd be a starting point.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: more efficient JSON encoding: idle musing

Carl Edquist
In reply to this post by Richard Hipp-3
> If you have example code for a mechanism that is more space efficient
> and/or faster, please share it with us.

"Bencode" is approximately the same space-wise as JSON, but
encoding/decoding is potentially faster since it doesn't have to do any
escaping for strings:

  https://en.wikipedia.org/wiki/Bencode


On Fri, 21 Feb 2020, Richard Hipp wrote:

> On 2/21/20, Wout Mertens <[hidden email]> wrote:
>> The idea is that upon storing the JSON
>> data, the JSON1 extension parses it, extracts the layouts recursively,
>> stores them when they are not known yet, and then only stores the
>> values in the binary format with the layout identifiers.
>
> I experimented with a number of similar ideas for storing JSON when I
> was first designing the JSON components for SQLite.  I was never able
> to find anything that was as fast or as compact as just storing the
> original JSON text.  But I could have overlooked something.  If you
> have example code for a mechanism that is more space efficient and/or
> faster, please share it with us.
>
> --
> D. Richard Hipp
> [hidden email]
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: more efficient JSON encoding: idle musing

Jens Alfke-2
In reply to this post by wmertens


> On Feb 21, 2020, at 4:20 AM, Wout Mertens <[hidden email]> wrote:
>
> In JavaScript, objects are key-value collections with unique keys, where the order of the keys is important.

JSON is not JavaScript. The order of keys is NOT significant in JSON, and many, many JSON implementations parse JSON objects into data structures* that do not preserve key order. Clients using such implementations cannot determine the original order of keys, nor can they re-serialize the object with the same ordering.

I have run into multiple issues over time with systems that assume JSON key ordering is significant and preserved (CouchDB and Scuttlebutt come to mind), which then later have to redesign protocols or data schema because of interoperability problems this causes.

—Jens

* The platform's standard "dictionary" / "map" class is typically implemented as a hash table or binary tree, as in Java, .NET, Objective-C, Swift, Go, Lua, C++. (I'm unsure about Python and Ruby.) A few platforms besides JS keep an auxiliary key array to preserve ordering. Erlang uses a linked-list, which is ordered.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: more efficient JSON encoding: idle musing

Jens Alfke-2
In reply to this post by wmertens


> On Feb 21, 2020, at 4:20 AM, Wout Mertens <[hidden email]> wrote:
>
> I was wondering if the JSON extension could not do the same thing: for each
> table, keep a hidden stash of object layouts, and store the values as
> sqlite primitives. (you'd be able to disable this, in case the layouts
> rarely repeat)

We do something a bit like this in Couchbase Lite[1]. We don't directly store JSON, we store a binary data format we designed called Fleece, which has the same semantics but is faster to traverse and more compact.

https://github.com/couchbaselabs/fleece

In Fleece a JSON object is represented as an array of fixed-width {key, value} pairs.
- The keys are typically 16-bit integers that map into a global (per-database) table of common key strings.
- The values are either inline if they fit (small ints, boolean, null) or else offsets to where the real data is.
The array is sorted by key, so lookup is a binary search, usually on 16-bit ints.

Numbers are stored in a compact variable-width encoding. Strings are unescaped UTF-8, and multiple occurrences of the same string are only stored once.

(I've recently discovered that this is similar to FlexBuffers. Fleece has been around since 2015, so I don't know if the designers of FlexBuffers looked at it…)

I've thought about the 'layouts' concept but haven't tried implementing it yet. (Most JavaScript runtimes use something like it, btw.) It would only save about 4*n bytes per record, where n is the average number of keys in a layout, and a small number of clock cycles per key lookup.

> On Feb 21, 2020, at 6:03 AM, Richard Hipp <[hidden email]> wrote:
>
> I experimented with a number of similar ideas for storing JSON when I
> was first designing the JSON components for SQLite.  I was never able
> to find anything that was as fast or as compact as just storing the
> original JSON text.

I had similar experiences with BSON, Bencode, etc. In my use case I found that the major expense wasn't parsing, rather allocating an object tree. So I designed the Fleece format to be fully useable in-place without requiring parsing. A secondary factor is that traversing objects and arrays is expensive because the items are variable width, so you have to parse through all the (potentially nested) items before the one you're looking for. That's why Fleece objects and arrays are fixed-width, using offsets to point to variable-size values.

—Jens

[1] https://www.couchbase.com/products/mobile
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: more efficient JSON encoding: idle musing

Martin Raiber
In reply to this post by Richard Hipp-3
On 21.02.2020 15:03 Richard Hipp wrote:

> On 2/21/20, Wout Mertens <[hidden email]> wrote:
>> The idea is that upon storing the JSON
>> data, the JSON1 extension parses it, extracts the layouts recursively,
>> stores them when they are not known yet, and then only stores the
>> values in the binary format with the layout identifiers.
> I experimented with a number of similar ideas for storing JSON when I
> was first designing the JSON components for SQLite.  I was never able
> to find anything that was as fast or as compact as just storing the
> original JSON text.  But I could have overlooked something.  If you
> have example code for a mechanism that is more space efficient and/or
> faster, please share it with us.

If you want to be as space efficient as possible, look into succinct
data structures. The balanced parenthesis representation of a tree can
be stored in a bit vector (each node 2 bits) and there are (succinct)
index structures to query that efficiently.

See e.g. https://core.ac.uk/download/pdf/81941172.pdf and
https://github.com/simongog/sdsl-lite/blob/master/include/sdsl/bp_support_g.hpp
as an implementation. Making this work with JSON would be a lot of work,
though, I guess.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: more efficient JSON encoding: idle musing

J Decker
In reply to this post by Richard Hipp-3
On Fri, Feb 21, 2020 at 6:03 AM Richard Hipp <[hidden email]> wrote:

> On 2/21/20, Wout Mertens <[hidden email]> wrote:
> > The idea is that upon storing the JSON
> > data, the JSON1 extension parses it, extracts the layouts recursively,
> > stores them when they are not known yet, and then only stores the
> > values in the binary format with the layout identifiers.
>
> I experimented with a number of similar ideas for storing JSON when I
> was first designing the JSON components for SQLite.  I was never able
> to find anything that was as fast or as compact as just storing the
> original JSON text.  But I could have overlooked something.  If you
> have example code for a mechanism that is more space efficient and/or
> faster, please share it with us.
>

text is as long as text is, and numbers, for small ranges, are also
compressed to 2 bytes (one for a separator, or opener, and 1 for the value)
gets you 0-9 (0-64 if you base64 encode it)... looking at just the data
part of JSON.  You end up with a lot of overhead from the repeated field
name definition.

I created a format
https://github.com/d3x0r/jsox#jsox--javascript-object-exchange-format that
is compatible with existing JSON, but adds the ability to specify 'class'
definitions.  There's a specification of the grammar in bnf format, and
pictures... It tracks the current parsing state, 0, initial being called
'unknown'.  If a string is found in an unknown state, followed by an
object, then that defines a class type... 'record{id,title,author,data}'
then after than, a second occurrence in the unknown state, or within an
array or object context, '[record{1342,"book","person",1}]'  would use the
existing list of names in order with the values, and build an object that
was { id:1342,title:"book",author:"person",data:1 }.   'record' could be
shortened to any single unicode character, otherwise the saving isn't so
great.
The definition of 'string' is sort of loose in JSOX, as long as there isn't
a format control character ( whitespace, ':', '{', '}', '[', ']' ) you
don't need quotes around a sequence of characters to make a string;
excepting of course starting with characters that look like a number,
and/or match a keyword...
The triggering of the mode is '{' after a string, or while collecting a
string

I also extended the number format of JSON to allow specifying ISO-8601
times as numbers.... (just have to special case in addition to '.'; ':' 'T'
'Z' '-' (inline and not just at start)).

other than that; if space is really a concern, maybe a zip layer?

J


> --
> D. Richard Hipp
> [hidden email]
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: more efficient JSON encoding: idle musing

Jens Alfke-2


> On Feb 25, 2020, at 6:12 AM, J Decker <[hidden email]> wrote:
>
> other than that; if space is really a concern, maybe a zip layer?

In my experience, the concern is more about speed than size. Given the raw string/blob data from a SQLite column, and a specific property name/path, how fast can you find its value, convert it to a format SQLite's query engine understand, and return it?

The JSON1 extension's parser seems pretty darn fast, but given the nature of JSON, it has to do a bunch of scanning that's O(n) with the length of the data. It's generally impossible to find a value in the middle of a JSON string without examining every single byte that came before it.

The way to go faster is to use formats that let you jump through the structure faster. For example, tokenizing dictionary keys and encoding an object's keys as a fixed-width sorted list lets you look up a key in O(log n) time, and the constant factor is very small because you're comparing integers not strings.

I don't know if extracting 'classes' will help much in a query. The data will be smaller, and it makes it extremely fast to look up values if you know the class ahead of time, but in a query you don't know the class. Compared to what I described above, there's an extra step where you have to look up the class description from the object.

I also worry about use cases where the number of 'classes' becomes unwieldy, because the schema might be using a huge set of keys. For example, something where a customer order contains an object that maps SKUs to quantities, like {"A73563M1": 3, "A73522M0": 7, …}. And there are tens of thousands of SKUs.

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: more efficient JSON encoding: idle musing

Paul van Helden
In reply to this post by Richard Hipp-3
>
>
> I experimented with a number of similar ideas for storing JSON when I
> was first designing the JSON components for SQLite.  I was never able
> to find anything that was as fast or as compact as just storing the
> original JSON text.
>

 I've also done a lot of experiments and was surprised at how little a
binary encoding saves in space. Also tried with lookups for keys, but the
lookup values quickly become close to the size of the keys (if not larger)
if keys are mostly shortish.

I'd be happy with a JSON5-like ability to have the quotes on keys optional
if they contain no spaces and no special characters. Seems to reduce the
data size quite significantly.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users