UTF8 and NUL

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
34 messages Options
12
Reply | Threaded
Open this post in threaded view
|

UTF8 and NUL

J Decker
NUL is a valid utf8 character
but FF is never valid.  (would be like a 36 bit length specification)
and practically anthing more than F8 is invalid utf8 character.
Other than BOM
https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
EF BB BF 239 187 191

// EF - 80 | 3b - 80 | 3f
( 0xfeff  )


Many Windows <https://en.wikipedia.org/wiki/Microsoft_Windows> programs
(including Windows Notepad <https://en.wikipedia.org/wiki/Notepad_(Windows)>)
add the bytes 0xEF, 0xBB, 0xBF at the start of any document saved as UTF-8.
Th

(Not that BOM is even required, because, it's already ordered bytes)
----------
But anYway FF could be used as a string terminator instead of 00.  It is
never legal in any utf-8 sequence.
(F8,F9,FA,FB,FC,FD,FE,FF)
F8 would be a 5 byte encoding, but that is more code points than unicode
has allocated.  It could be potentially useful to permit a little extra
space in sequences , so I would avoid F8(F9,FA,FB) and stick to FC-FF for
possible control characters.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 and NUL

J Decker
https://en.wikipedia.org/wiki/List_of_Unicode_characters#Control_codes
Even the Control codes within unicode aren't FF.

U+009C 156 String Terminator ST
literal bytes \xC2\x9c  are string terminator ... Was thinking that like
APC and ST were higher than that... more in the range of 0xF8-0xFF



On Thu, Jan 25, 2018 at 7:57 PM, J Decker <[hidden email]> wrote:

> NUL is a valid utf8 character
> but FF is never valid.  (would be like a 36 bit length specification)
> and practically anthing more than F8 is invalid utf8 character.
> Other than BOM
> https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
> EF BB BF 239 187 191
>
> // EF - 80 | 3b - 80 | 3f
> ( 0xfeff  )
>
>
> Many Windows <https://en.wikipedia.org/wiki/Microsoft_Windows> programs
> (including Windows Notepad
> <https://en.wikipedia.org/wiki/Notepad_(Windows)>) add the bytes 0xEF,
> 0xBB, 0xBF at the start of any document saved as UTF-8. Th
>
> (Not that BOM is even required, because, it's already ordered bytes)
> ----------
> But anYway FF could be used as a string terminator instead of 00.  It is
> never legal in any utf-8 sequence.
> (F8,F9,FA,FB,FC,FD,FE,FF)
> F8 would be a 5 byte encoding, but that is more code points than unicode
> has allocated.  It could be potentially useful to permit a little extra
> space in sequences , so I would avoid F8(F9,FA,FB) and stick to FC-FF for
> possible control characters.
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 and NUL

Clemens Ladisch
J Decker wrote:
> U+009C 156 String Terminator ST

"ST is used as the closing delimiter of a control string opened by
APPLICATION PROGRAM COMMAND (APC), DEVICE CONTROL STRING (DCS),
OPERATING SYSTEM COMMAND (OSC), PRIVACY MESSAGE (PM), or START OF
STRING (SOS)."


Regards,
Clemens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 and NUL

Peter da Silva
In reply to this post by J Decker
What is the goal of this discussion? Changing the string terminator SQLite uses? I think it's almost 50 years too late for that, but I'm sure that if Unicode and UTF8 had been a thing in 1970 then C would have selected FF as the string terminator.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 and NUL

Gary R. Schmidt-2
On 27/01/2018 00:55, Peter Da Silva wrote:
> What is the goal of this discussion? Changing the string terminator SQLite uses? I think it's almost 50 years too late for that, but I'm sure that if Unicode and UTF8 had been a thing in 1970 then C would have selected FF as the string terminator.
> But how would you differentiate EOF???  (Let me guess, 0.  :-) )

        Cheers,
                Gary B-)
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 and NUL

Peter da Silva
On 1/26/18, 8:24 AM, "sqlite-users on behalf of Gary R. Schmidt" <[hidden email] on behalf of [hidden email]> wrote:
> But how would you differentiate EOF???  (Let me guess, 0.  :-) )
   
End of file is not part of the contents of the file or a string. It's metadata.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 and NUL

J Decker
In reply to this post by Peter da Silva
On Fri, Jan 26, 2018 at 5:55 AM, Peter Da Silva <
[hidden email]> wrote:

> What is the goal of this discussion? Changing the string terminator SQLite
> uses? I think it's almost 50 years too late for that, but I'm sure that if
> Unicode and UTF8 had been a thing in 1970 then C would have selected FF as
> the string terminator.
>
> There's so much resistence to handling NUL in command line tools, test and
in the engine itself, I figured there must be a reason; maybe the
Authentication/Encryption that has been added to sqlite by sqlite people
stores meta data after field content; Such content could still be kept, and
isolated from users with an alternative string terminator;  since that
character is never returned to the user it doesn't matter what sqlite uses
internally (other than having previously used something else)  Is probably
a change that is 3.x to 4.x significant though.


> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 and NUL

Keith Medcalf
In reply to this post by Peter da Silva

Actually, EOF (0xFF) *is* part of a text file, and is the byte in an ASCII byte-stream that indicates end-of-file.  In the "old days" the bytes following the last-byte in a stream and the end of a storage block (sector/cluster/track/cylinder, what have you) were padded with 0xFF so you knew you were past the end-of-the-file when you were reading it.

Just "more modern" Operating Systems are capable of setting the file length more accurately than in the past. And "stream" processors now recognize "running out of data" as EOF.  Just because it is now thus does not mean it was always so. (And, of course, just because a "stream" has no more data to return does not necessarily mean that it is at end-of-file, merely that there is no more data to return *at the moment* -- perhaps the card reader is jammed or the paper-tape broke :) ).

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.

>-----Original Message-----
>From: sqlite-users [mailto:sqlite-users-
>[hidden email]] On Behalf Of Peter Da Silva
>Sent: Friday, 26 January, 2018 07:30
>To: SQLite mailing list
>Subject: Re: [sqlite] UTF8 and NUL
>
>On 1/26/18, 8:24 AM, "sqlite-users on behalf of Gary R. Schmidt"
><[hidden email] on behalf of
>[hidden email]> wrote:
>> But how would you differentiate EOF???  (Let me guess, 0.  :-) )
>
>End of file is not part of the contents of the file or a string. It's
>metadata.
>
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 and NUL

Peter da Silva
On 1/26/18, 12:12 PM, "sqlite-users on behalf of Keith Medcalf" <[hidden email] on behalf of [hidden email]> wrote:
> Actually, EOF (0xFF) *is* part of a text file, and is the byte in an ASCII byte-stream that indicates end-of-file.  In the "old days" the bytes following the last-byte in a stream and the end of a storage block (sector/cluster/track/cylinder, what have you) were padded with 0xFF so you knew you were past the end-of-the-file when you were reading it.

Oh, I remember the messes that existed before stream files became the norm. But messes they were, and there's no more reason to support them in a Unicode file than there is to support FIELDDATA format.

And if you're going to talk about the block file and paper tape era, don't forget that FF also meant a deleted character and should be skipped without being counted or accounted for.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 and NUL

J Decker
On Fri, Jan 26, 2018 at 10:22 AM, Peter Da Silva <
[hidden email]> wrote:

> On 1/26/18, 12:12 PM, "sqlite-users on behalf of Keith Medcalf" <
> [hidden email] on behalf of
> [hidden email]> wrote:
> > Actually, EOF (0xFF) *is* part of a text file, and is the byte in an
> ASCII byte-stream that indicates end-of-file.  In the "old days" the bytes
> following the last-byte in a stream and the end of a storage block
> (sector/cluster/track/cylinder, what have you) were padded with 0xFF so
> you knew you were past the end-of-the-file when you were reading it.
>
> Oh, I remember the messes that existed before stream files became the
> norm. But messes they were, and there's no more reason to support them in a
> Unicode file than there is to support FIELDDATA format.
>
> And if you're going to talk about the block file and paper tape era, don't
> forget that FF also meant a deleted character and should be skipped without
> being counted or accounted for.
>
>
ctrl-z was end of file text character in DOS (wrote char 26; not FF)
EOF is returned as -1 not 0xFF (although signed char looks really similar)
the character 0xFF is 0xC3 0xBF nof 0xFF.



> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 and NUL

Peter da Silva
On 1/26/18, 12:31 PM, "sqlite-users on behalf of J Decker" <[hidden email] on behalf of [hidden email]> wrote:
> ctrl-z was end of file text character in DOS (wrote char 26; not FF)

DOS wasn't an operating system.
 

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 and NUL

Tim Streater-3
In reply to this post by Keith Medcalf
On 26 Jan 2018, at 18:12, Keith Medcalf <[hidden email]> wrote:

> Actually, EOF (0xFF) *is* part of a text file, and is the byte in an ASCII
> byte-stream that indicates end-of-file.

First I've heard of that. Which systems did that then? EOF is normally indicated by the file system, not by file data.


--
Cheers  --  Tim
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 and NUL

J Decker
On Fri, Jan 26, 2018 at 10:35 AM, Tim Streater <[hidden email]> wrote:

> On 26 Jan 2018, at 18:12, Keith Medcalf <[hidden email]> wrote:
>
> > Actually, EOF (0xFF) *is* part of a text file, and is the byte in an
> ASCII
> > byte-stream that indicates end-of-file.
>
> First I've heard of that. Which systems did that then? EOF is normally
> indicated by the file system, not by file data.
>
> the 't' part of fopen( "xxx", "rt" );

reads the bytes and does things with them.  the EOF would get returned with
fgetc() but not the character.

>
> --
> Cheers  --  Tim
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 and NUL

Peter da Silva
On 1/26/18, 12:40 PM, "sqlite-users on behalf of J Decker" <[hidden email] on behalf of [hidden email]> wrote:
>  reads the bytes and does things with them.  the EOF would get returned with fgetc() but not the character.

Fgetc returns an int, not a byte. That EOF is -1, not 0xFF.



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 and NUL

J Decker
On Fri, Jan 26, 2018 at 10:44 AM, Peter Da Silva <
[hidden email]> wrote:

> On 1/26/18, 12:40 PM, "sqlite-users on behalf of J Decker" <
> [hidden email] on behalf of [hidden email]>
> wrote:
> >  reads the bytes and does things with them.  the EOF would get returned
> with fgetc() but not the character.
>
> Fgetc returns an int, not a byte. That EOF is -1, not 0xFF.
>
doesn't get 26 either. 0x1a

>
>
>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 and NUL

Peter da Silva
On 1/26/18, 1:37 PM, "sqlite-users on behalf of J Decker" <[hidden email] on behalf of [hidden email]> wrote:
>    doesn't get 26 either. 0x1a

26 isn't EOF, it's SUB (substitute). It was used to represent untranslatable characters when converting (for example) EBCDIC to ASCII.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 and NUL

John McKown
On Fri, Jan 26, 2018 at 1:41 PM, Peter Da Silva <
[hidden email]> wr

> On 1/26/18, 1:37 PM, "sqlite-users on behalf of J Decker" <
> [hidden email] on behalf of [hidden email]>
> wrote:
> >    doesn't get 26 either. 0x1a
>
> 26 isn't EOF, it's SUB (substitute). It was used to represent
> untranslatable characters when converting (for example) EBCDIC to ASCII.
>

​In the distant past (CP/M-80), the filesystem meta data did not include
the actual _length_ of the data for a text data file. The I/O was done in
sectors. The CP/M-80 system, by convention, used 0x1A (26) and an "logical
EOF" indication and the C routines would detect it and report EOF.​ MS-DOS
basically didthe same thing, for compatibility reasons. I am not sure, but
I think that Windows still does this. A quick test with the command "type
x.txt" where "x.txt" contained "abc~def" (where ~ is standing in for 0x1a)
resulted in my seeing "abc". But "notepad x.txt" shows "abc def". So I
guess it depends on how old the Windows app is.


--
I have a theory that it's impossible to prove anything, but I can't prove
it.

Maranatha! <><
John McKown
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 and NUL

Peter da Silva
On 1/26/18, 2:11 PM, "sqlite-users on behalf of John McKown" <[hidden email] on behalf of [hidden email]> wrote:
> ​In the distant past (CP/M-80), the filesystem meta data did not include the actual _length_ of the data for a text data file.

Since DOS wasn't an OS, then CP/M certainly wasn't.
 

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 and NUL

J. King-3
On 2018-01-26 15:13:46, "Peter Da Silva" <[hidden email]>
wrote:

>On 1/26/18, 2:11 PM, "sqlite-users on behalf of John McKown"
><[hidden email] on behalf of
>[hidden email]> wrote:
>>​In the distant past (CP/M-80), the filesystem meta data did not
>>include the actual _length_ of the data for a text data file.
>
>Since DOS wasn't an OS, then CP/M certainly wasn't.

Do you have a point in making either statement? If you do, I'm really
not seeing it.

--
J. King

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 and NUL

Peter da Silva
On 1/26/18, 2:34 PM, "sqlite-users on behalf of J. King" <[hidden email] on behalf of [hidden email]> wrote:
> Do you have a point in making either statement? If you do, I'm really not seeing it.

The point is that apart from CP/M and derivatives like DOS, this kind of behavior is strictly a leftover from the '60s. And CP/M only had this restriction because it was tremendously resource-constrained. It's not a precedent for treating some magic character as an end-of-file marker when virtually every operating released since 1970 system (apart from a couple that derived from this historical anomaly) has had files with byte-precise size metadata.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
12