Re: Feature request: more robust handling of invalid UTF-16 data

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Feature request: more robust handling of invalid UTF-16 data

Dennis Snell
I’d like to raise this issue again and give my support for what Maks Verver recommended in https://www.mail-archive.com/sqlite-users@.../msg110107.html


Independently I came to this bug while working on an issue in Simplenote’s Android app where our data was being corrupted when saved to sqlite inside the Android SDK. We received some invalid UTF-16 sequences and instead of rejecting them or decoding it properly sqlite is further mangling them and introducing more corruption.


Example:
We have a JSON document like this which we store in a table.


    {“content”: “\ud83c\udd70\ud83c(null)\udd71”,”tags":[]}


The JSON is well-formed but the sequence of UTF-16 code points is invalid. We have fixed our side of the equation which prevents creating this content, but we still receive from time to time the invalid sequence from older client libraries.


When sqlite reads this data two types of further corruption occur: reading beyond a code unit subsequence; and conflating high and low surrogates.


Reading beyond a code unit subsequence:


When the `TERM` was introduced[1] and updated[2] it appears to have been designed to assume that a string ends mid-surrogate but it does not attempt to address unpaired surrogates in the middle of an input text. In our case the `READ_UTF16BE` macro accepts the second `\ud83c` code unit and then consumes the following `\u0028` which is the separate and well-formed “(“. In turn this produces the more corrupted value of `\ud83c\udc28`, code point U+1F028, plus “null)” without the leading “(“.


Conflating high and low surrogates:


The `READ_UTF16__` macros both attempt to start processing surrogate pairs based on the `0xD800 <= c <= 0xE000` value of the input code unit. Because of this they will pick up on unpaired low surrogates, consume the next character, and then create a more corrupted Unicode string.


In our case, once we reach the `\udd71` the macro consumes the following quote, which in the JSON document closes the string, and puts them together as `\udd71\u0022` producing the invalid code point U+6C422. Moreover, because it consumed the string-ending quote it also corrupted the entire JSON document, as the new output resembles the following:


    {“content”: “\ud83c\udd70\ud83c\udc28ull)\ud971\udc22,”tags”:[]}


That is, we write this invalid Unicode sequence but valid JSON document into sqlite and read back an invalid Unicode sequence _and_ invalid JSON (see the missing quote before “tags”).


Supporting Unicode spec:


The Unicode specification[3] sections 3.2 and 3.9 speak to this situation and provides a specific comparable example:


    When a process interprets a code unit sequence which purports to be in a Unicode
    character encoding form, it shall treat ill-formed code unit sequences as an error
    condition and shall not interpret such sequences as characters.


    Furthermore, such a process must not treat any adjacent well-formed code unit
    sequences as being part of those ill-formed code unit sequences.


    For example, with the input UTF-8 code unit sequence <C2 41 42>, such a UTF-8
    conversion process must not return <U+FFFD> or <U+FFFD, U+0042>, because
    either of those outputs would be the result of interpreting a well-formed subsequence
    as being part of the ill-formed subsequence. The expected return value for such a
    process would instead be <U+FFFD, U+0041, U+0042>.


Supporing Maks’ suggestion to use the replacement character on error section 23.8[4] provides the guidance:


    It [U+FFFD] can be substituted for any “unknown” character in another encoding that
    cannot be mapped in terms of known Unicode characters. It can also be used as one
    means of indicating a conversion error, when encountering an ill-formed sequence in
    a conversion between Unicode encoding forms.


Patching:


The `READ_UTF16__` macros thus should do not only what Maks proposed, which is to verify that the character following a surrogate half is also a surrogate half, but also to verify that we don’t start interpreting a surrogate sequence when encountering an unpaired low surrogate. I propose this change instead:


    #define READ_UTF16LE(zIn, TERM, c){
        c = (*zIn++);
        c += ((*zIn++)<<8);
        if( c>=0xDC00 && c<=0xE000 && TERM ) {
            c = 0xFFFD
        } else if( c>=0xD800 && TERM ){
            int c2 = (zIn[0] | (zIn[1] << 8));
            if ( c2>=0xDC00 && c2<0xE000) {
                zIn += 2;
                c = (c2&0x03FF) + ((c&0x003F)<<10) + (((c&0x03C0)+0x0040)<<10);
            } else {
                c = 0xFFFD;
            }
        }
    }



This will solve both problem of reading past an ill-formed surrogate sequence and of interpreting an ill-formed surrogate sequence. I love sqlite and I really appreciated how the code is laid out which made it so easy to find this macro in the source and identify the problem.


Dennis Snell
Automattic, Inc.


[1]: https://sqlite.org/src/info/19064d7cea

[2]: https://sqlite.org/src/info/a96b4e8c01d167d3

[3]: https://www.unicode.org/versions/Unicode12.0.0/ch03.pdf

[4]: https://www.unicode.org/versions/Unicode12.0.0/ch23.pdf






_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Feature request: more robust handling of invalid UTF-16 data

Detlef Golze
I want to second that. This leads to the situation that data is accepted by the database but there is no way to read that data back or more precisely I get the wrong (i.e. different) data back. I didn't check the suggested patch, but I don't believe it will work in all cases. I'd rather prefer rejecting such strings or implicitly  convert them to a BLOB which at least provides a way to get the data back.

Thanks,
Detlef.

-----Ursprüngliche Nachricht-----
Von: sqlite-users <[hidden email]> Im Auftrag von Dennis Snell
Gesendet: Montag, 13. Januar 2020 21:57
An: [hidden email]
Betreff: Re: [sqlite] Feature request: more robust handling of invalid UTF-16 data

I’d like to raise this issue again and give my support for what Maks Verver recommended in https://www.mail-archive.com/sqlite-users@.../msg110107.html


Independently I came to this bug while working on an issue in Simplenote’s Android app where our data was being corrupted when saved to sqlite inside the Android SDK. We received some invalid UTF-16 sequences and instead of rejecting them or decoding it properly sqlite is further mangling them and introducing more corruption.


Example:
We have a JSON document like this which we store in a table.


    {“content”: “\ud83c\udd70\ud83c(null)\udd71”,”tags":[]}


The JSON is well-formed but the sequence of UTF-16 code points is invalid. We have fixed our side of the equation which prevents creating this content, but we still receive from time to time the invalid sequence from older client libraries.


When sqlite reads this data two types of further corruption occur: reading beyond a code unit subsequence; and conflating high and low surrogates.


Reading beyond a code unit subsequence:


When the `TERM` was introduced[1] and updated[2] it appears to have been designed to assume that a string ends mid-surrogate but it does not attempt to address unpaired surrogates in the middle of an input text. In our case the `READ_UTF16BE` macro accepts the second `\ud83c` code unit and then consumes the following `\u0028` which is the separate and well-formed “(“. In turn this produces the more corrupted value of `\ud83c\udc28`, code point U+1F028, plus “null)” without the leading “(“.


Conflating high and low surrogates:


The `READ_UTF16__` macros both attempt to start processing surrogate pairs based on the `0xD800 <= c <= 0xE000` value of the input code unit. Because of this they will pick up on unpaired low surrogates, consume the next character, and then create a more corrupted Unicode string.


In our case, once we reach the `\udd71` the macro consumes the following quote, which in the JSON document closes the string, and puts them together as `\udd71\u0022` producing the invalid code point U+6C422. Moreover, because it consumed the string-ending quote it also corrupted the entire JSON document, as the new output resembles the following:


    {“content”: “\ud83c\udd70\ud83c\udc28ull)\ud971\udc22,”tags”:[]}


That is, we write this invalid Unicode sequence but valid JSON document into sqlite and read back an invalid Unicode sequence _and_ invalid JSON (see the missing quote before “tags”).


Supporting Unicode spec:


The Unicode specification[3] sections 3.2 and 3.9 speak to this situation and provides a specific comparable example:


    When a process interprets a code unit sequence which purports to be in a Unicode
    character encoding form, it shall treat ill-formed code unit sequences as an error
    condition and shall not interpret such sequences as characters.


    Furthermore, such a process must not treat any adjacent well-formed code unit
    sequences as being part of those ill-formed code unit sequences.


    For example, with the input UTF-8 code unit sequence <C2 41 42>, such a UTF-8
    conversion process must not return <U+FFFD> or <U+FFFD, U+0042>, because
    either of those outputs would be the result of interpreting a well-formed subsequence
    as being part of the ill-formed subsequence. The expected return value for such a
    process would instead be <U+FFFD, U+0041, U+0042>.


Supporing Maks’ suggestion to use the replacement character on error section 23.8[4] provides the guidance:


    It [U+FFFD] can be substituted for any “unknown” character in another encoding that
    cannot be mapped in terms of known Unicode characters. It can also be used as one
    means of indicating a conversion error, when encountering an ill-formed sequence in
    a conversion between Unicode encoding forms.


Patching:


The `READ_UTF16__` macros thus should do not only what Maks proposed, which is to verify that the character following a surrogate half is also a surrogate half, but also to verify that we don’t start interpreting a surrogate sequence when encountering an unpaired low surrogate. I propose this change instead:


    #define READ_UTF16LE(zIn, TERM, c){
        c = (*zIn++);
        c += ((*zIn++)<<8);
        if( c>=0xDC00 && c<=0xE000 && TERM ) {
            c = 0xFFFD
        } else if( c>=0xD800 && TERM ){
            int c2 = (zIn[0] | (zIn[1] << 8));
            if ( c2>=0xDC00 && c2<0xE000) {
                zIn += 2;
                c = (c2&0x03FF) + ((c&0x003F)<<10) + (((c&0x03C0)+0x0040)<<10);
            } else {
                c = 0xFFFD;
            }
        }
    }



This will solve both problem of reading past an ill-formed surrogate sequence and of interpreting an ill-formed surrogate sequence. I love sqlite and I really appreciated how the code is laid out which made it so easy to find this macro in the source and identify the problem.


Dennis Snell
Automattic, Inc.


[1]: https://sqlite.org/src/info/19064d7cea

[2]: https://sqlite.org/src/info/a96b4e8c01d167d3

[3]: https://www.unicode.org/versions/Unicode12.0.0/ch03.pdf

[4]: https://www.unicode.org/versions/Unicode12.0.0/ch23.pdf






_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Feature request: more robust handling of invalid UTF-16 data

Richard Hipp-3
In reply to this post by Dennis Snell
On 1/13/20, Dennis Snell <[hidden email]> wrote:
> We have a JSON document like this which we store in a table.
>
>     {“content”: “\ud83c\udd70\ud83c(null)\udd71”,”tags":[]}
>
>
> The JSON is well-formed but the sequence of UTF-16 code points is invalid.
>
> When sqlite reads this data two types of further corruption

I'm having trouble reproducing this.  The following test script (one
of many) illustrates:

CREATE TABLE t1(j TEXT);
INSERT INTO t1(j) VALUES
    ('{"content": "\ud83c\udd70\ud83c(null)\udd71","tags":[]}');
SELECT length(json_extract(j,'$.content')) FROM t1;
WITH RECURSIVE c(x) AS (VALUES(1) UNION ALL SELECT x+1 FROM c WHERE x<9)
SELECT x, printf('%x',unicode(substr(json_extract(j,'$.content'),x)))
  FROM t1, c;

The column T1.J is loaded up with your original JSON with the invalid
code points. Then I run json_extract() to pull out the invalid string,
but SQLite says that the length is 9, which I think is the correct
answer.  The second SELECT with the CTE in it loops over each
character and prints out the HEX value for that character.  Here is
what I see:

1|1f170
2|fffd
3|28
4|6e
5|75
6|6c
7|6c
8|29
9|fffd

So the initial surrogate pair was rendered correctly as 0x1f170.  The
\ud83c without the following high surrogate was converted into 0xfffd
(which is the right thing to do, is it not).  Then the 6 ASCII
characters follow.  Finally, the last isolated high-surrogate is
(correctly?) converted into 0xfffd.

What behavior were you expecting?

Is there something that I can be doing differently to make it misbehave?

--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Feature request: more robust handling of invalid UTF-16 data

Richard Hipp-3
On 1/14/20, Richard Hipp <[hidden email]> wrote:
> I'm having trouble reproducing this.

I went back to version 3.30.1 and I was able to reproduce it.  So I
bisected and found the following:

https://sqlite.org/src/timeline?c=51027f08c0478f1b

--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Feature request: more robust handling of invalid UTF-16 data

Maks Verver
*Richard:* the issue with the JSON extension seems unrelated to the issue
that I reported originally, which relates to the SQLite C API
(specifically, the sqlite3_bind_text16() and sqlite3_bind_text16()
functions). My issue is still not fixed.

I've expanded my original sample code to make it easier to run the test and
reproduce the problems:
https://gist.github.com/maksverver/c3d5da8a0a9f2ec1c2a225209f290e13

Let me know if you need more help understanding/reproducing the problem.

*Dennis:* thanks for raising the issue again, and for digging through the
Unicode standard to confirm the most reasonable behavior.

I think your proposed patch is not quite correct. I think I spot two
problems. One:

      if( c>=0xDC00 && c<=0xE000 && TERM ) {

.. here, you should drop the `&& TERM`, otherwise you'll fail to replace a
high surrogate when it occurs at the end of the string. Also, you should
check c<0xE000 because 0xE000 is a valid character by itself. Two:

    } else if( c>=0xD800 && TERM ){

Here, you should also check c<0xDC00 (or c<0xE000), otherwise you'll
misinterpret valid characters with code points 0xE000 and above as part of
a surrogate pair.

I believe my original patch handles these and all other cases correctly.

  - Maks.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users