SIMD based JSON parsing for speeding up JSON extension

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

SIMD based JSON parsing for speeding up JSON extension

Robert M. Münch
Hi, see: https://github.com/lemire/simdjson

Can parse GB/s of JSON input. This might be a good candidate to use in the extension.

--

Robert M. Münch

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

signature.asc (891 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: SIMD based JSON parsing for speeding up JSON extension

Richard Hipp-3
On 2/25/19, Robert M. Münch <[hidden email]> wrote:
> Hi, see: https://github.com/lemire/simdjson
>
> Can parse GB/s of JSON input. This might be a good candidate to use in the
> extension.

Thanks for the link.

I downloaded one of the sample input files "gsoc-2018.json" and then
ran the following test case using the SQLite command-line shell:

.timer on
SELECT length(readfile('/home/drh/tmp/gsoc-2018.json'));
WITH RECURSIVE c(x) AS (VALUES(1) UNION ALL SELECT x+1 FROM c WHERE x<1000)
SELECT sum(json_valid(readfile('/home/drh/tmp/gsoc-2018.json'))) FROM c;

The script above first measures the length of the input json file
(3327831 bytes) then it parses the file 1000 times.  The second second
statement ran in 1.101 seconds (real time) on my 4-year-old linux
workstation, for a performance of just over 3GB/sec, which is slightly
faster than reported simdjson performance of 2.9GB/sec.

--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: SIMD based JSON parsing for speeding up JSON extension

Richard Hipp-3
On 2/25/19, Richard Hipp <[hidden email]> wrote:
> performance of just over 3GB/sec, which is slightly
> faster than reported simdjson performance of 2.9GB/sec.

Further analysis shows that SQLite was caching its parse tree, which
was distorting the measurement.  The following script adds a different
string of spaces to the end of each instance of gsoc-2019.json that is
parsed, thereby invalidating the cache.

.timer on
CREATE TEMP TABLE [$Parameters](key TEXT PRIMARY KEY,value) WITHOUT ROWID;
INSERT INTO [$Parameters](key,value)
 VALUES('$json',readfile('/home/drh/tmp/gsoc-2018.json'));
SELECT length($json);
WITH RECURSIVE c(x) AS (VALUES(1) UNION ALL SELECT x+1 FROM c WHERE x<1000)
SELECT count(json_valid($json||printf('%*c',x,' '))) FROM c;

In this case, SQLite parses JSON at 1.1GB/sec.  That is slower than
simdjson, but it is still pretty fast.  And there are other reasons to
prefer the current SQLite implementation:

(1) The SQLite code is public domain.  Simdjson is not.  We do not
want a license on SQLite that says something like "Public Domain
unless you use JSON functions, in which case the license is Apache."

(2) SQLite is written in portable C code.  It runs everywhere.
Simdjson is written in C++ and makes use of SIMD extensions that are
not universally available.

(3) Simdjson is optimized for large JSON blobs.  SQLite is optimized
for the common database case of small JSON blobs.

--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: SIMD based JSON parsing for speeding up JSON extension

wmertens
It's only a fair comparison if the simdjson code runs on the same system.
It might reach 10GB/s or 200MB/s…

Another possible concern is whether the SQLite JSON code is 100% compliant
(I don't know if this is the case). There are some hairy edge cases in JSON
(Unicode handling) that might slow down processing if they need to be
handled according to spec. Then the question becomes if that is important.

Just playing devil's advocate, those reasons given look very solid.

Wout.


On Mon, Feb 25, 2019 at 9:40 PM Richard Hipp <[hidden email]> wrote:

> On 2/25/19, Richard Hipp <[hidden email]> wrote:
> > performance of just over 3GB/sec, which is slightly
> > faster than reported simdjson performance of 2.9GB/sec.
>
> Further analysis shows that SQLite was caching its parse tree, which
> was distorting the measurement.  The following script adds a different
> string of spaces to the end of each instance of gsoc-2019.json that is
> parsed, thereby invalidating the cache.
>
> .timer on
> CREATE TEMP TABLE [$Parameters](key TEXT PRIMARY KEY,value) WITHOUT ROWID;
> INSERT INTO [$Parameters](key,value)
>  VALUES('$json',readfile('/home/drh/tmp/gsoc-2018.json'));
> SELECT length($json);
> WITH RECURSIVE c(x) AS (VALUES(1) UNION ALL SELECT x+1 FROM c WHERE x<1000)
> SELECT count(json_valid($json||printf('%*c',x,' '))) FROM c;
>
> In this case, SQLite parses JSON at 1.1GB/sec.  That is slower than
> simdjson, but it is still pretty fast.  And there are other reasons to
> prefer the current SQLite implementation:
>
> (1) The SQLite code is public domain.  Simdjson is not.  We do not
> want a license on SQLite that says something like "Public Domain
> unless you use JSON functions, in which case the license is Apache."
>
> (2) SQLite is written in portable C code.  It runs everywhere.
> Simdjson is written in C++ and makes use of SIMD extensions that are
> not universally available.
>
> (3) Simdjson is optimized for large JSON blobs.  SQLite is optimized
> for the common database case of small JSON blobs.
>
> --
> D. Richard Hipp
> [hidden email]
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: SIMD based JSON parsing for speeding up JSON extension

Andreas Kupries
> It's only a fair comparison if the simdjson code runs on the same system.
> It might reach 10GB/s or 200MB/s…
>
> Another possible concern is whether the SQLite JSON code is 100% compliant

AFAIK no known json parser is 100% compliant.

> (I don't know if this is the case). There are some hairy edge cases in JSON
> (Unicode handling) that might slow down processing if they need to be
> handled according to spec. Then the question becomes if that is important.

http://seriot.ch/parsing_json.php

is a very in-depth analysis of the json spec, its edge cases, and of
many json parsers with respect to these. The associated torture test
suite used by and written for that article can be found at

        https://github.com/nst/JSONTestSuite

> Just playing devil's advocate, those reasons given look very solid.
>
> Wout.

--
See you,
        Andreas Kupries <[hidden email]>
                        <http://core.tcl.tk/akupries/>
        Developer @ SUSE (MicroFocus Canada LLC)
                  <[hidden email]>

EuroTcl 2019, Jun 29-30, Nuernberg/DE, http://eurotcl.eu/
-------------------------------------------------------------------------------





_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users