UTF8-BOM not disregarded in CSV import

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

UTF8-BOM not disregarded in CSV import

Mahmoud Al-Qudsi
Hello all,

Let me start off with my apologies if this is a documented issue; I did search the fossil tickets but did not find anything for “BOM”.

As of SQLite 3.19.3, under `.mode csv` and with `.import ……`, SQLite3 includes a BOM (UTF-8) as part of the first column of the first record.

IMHO, this is of particular importance since the latest versions of MS Excel default to “UTF-8 CSV” which includes a BOM. Would anyone be opposed to a patch to SQLite that disregarded a BOM when found during a csv import operation?

Thank you kindly,

Mahmoud Al-Qudsi
NeoSmart Technologies

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: UTF8-BOM not disregarded in CSV import

Clemens Ladisch
Mahmoud Al-Qudsi wrote:
> with `.import ……`, SQLite3 includes a BOM (UTF-8) as part of the first
> column of the first record.

The Unicode Standard 9.0 says in section 3.10:
| When represented in UTF-8, the byte order mark turns into the byte
| sequence <EF BB BF>. Its usage at the beginning of a UTF-8 data stream
| is neither required nor recommended by the Unicode Standard,

so you should not use it.

Treating this character as a zero width no-break space, and keeping it,
is a correct interpretation of the file.

> IMHO, this is of particular importance since the latest versions of MS
> Excel default to “UTF-8 CSV” which includes a BOM.

That's wrong:
| When converting between different encoding schemes, extreme care must
| be taken in handling any initial byte order marks. For example, if one
| converted a UTF-16 byte serialization with an initial byte order mark
| to a UTF-8 byte serialization, thereby converting the byte order mark
| to <EF BB BF> in the UTF-8 form, the <EF BB BF> would now be ambiguous
| as to its status as a byte order mark (from its source) or as an
| initial zero width no-break space. If the UTF-8 byte serialization
| were then converted to UTF-16BE and the initial <EF BB BF> were
| converted to <FE FF>, the interpretation of the U+FEFF character would
| have been modified by the conversion. This would be nonconformant
| behavior according to conformance clause C7, because the change
| between byte serializations would have resulted in modification of the
| interpretation of the text. This is one reason why the use of the
| initial byte sequence <EF BB BF> as a signature on UTF-8 byte
| sequences is not recommended by the Unicode Standard.

And Google Docs also thinks it would be a good idea to act against
this recommendation:
<https://productforums.google.com/forum/#!topic/docs/p_jCTwzuIqk>

> Would anyone be opposed to a patch to SQLite that disregarded a BOM
> when found during a csv import operation?

Well, being wrong doesn't mean that Microsoft or Google will change
their behaviour ...


Regards,
Clemens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: UTF8-BOM not disregarded in CSV import

Mahmoud Al-Qudsi
I think you and I are on the same page here, Clemens? I abhor the BOM, but the question is whether or not SQLite will cater to the fact that the bigger names in the industry appear hell-bent on shoving it in users’ documents by default.

Given that ‘.import’ and ‘.mode csv’ are “user mode” commands, perhaps leeway can be shown in breaking with standards for the sake of compatibility and sanity?

Mahmoud

From: Clemens Ladisch
Sent: Friday, June 23, 2017 2:25 AM
To: [hidden email]
Subject: Re: [sqlite] UTF8-BOM not disregarded in CSV import

Mahmoud Al-Qudsi wrote:
> with `.import ……`, SQLite3 includes a BOM (UTF-8) as part of the first
> column of the first record.

The Unicode Standard 9.0 says in section 3.10:
| When represented in UTF-8, the byte order mark turns into the byte
| sequence <EF BB BF>. Its usage at the beginning of a UTF-8 data stream
| is neither required nor recommended by the Unicode Standard,

so you should not use it.

Treating this character as a zero width no-break space, and keeping it,
is a correct interpretation of the file.

> IMHO, this is of particular importance since the latest versions of MS
> Excel default to “UTF-8 CSV” which includes a BOM.

That's wrong:
| When converting between different encoding schemes, extreme care must
| be taken in handling any initial byte order marks. For example, if one
| converted a UTF-16 byte serialization with an initial byte order mark
| to a UTF-8 byte serialization, thereby converting the byte order mark
| to <EF BB BF> in the UTF-8 form, the <EF BB BF> would now be ambiguous
| as to its status as a byte order mark (from its source) or as an
| initial zero width no-break space. If the UTF-8 byte serialization
| were then converted to UTF-16BE and the initial <EF BB BF> were
| converted to <FE FF>, the interpretation of the U+FEFF character would
| have been modified by the conversion. This would be nonconformant
| behavior according to conformance clause C7, because the change
| between byte serializations would have resulted in modification of the
| interpretation of the text. This is one reason why the use of the
| initial byte sequence <EF BB BF> as a signature on UTF-8 byte
| sequences is not recommended by the Unicode Standard.

And Google Docs also thinks it would be a good idea to act against
this recommendation:
<https://productforums.google.com/forum/#!topic/docs/p_jCTwzuIqk>

> Would anyone be opposed to a patch to SQLite that disregarded a BOM
> when found during a csv import operation?

Well, being wrong doesn't mean that Microsoft or Google will change
their behaviour ...


Regards,
Clemens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: UTF8-BOM not disregarded in CSV import

Cezary H. Noweta
Hello,

On 2017-06-23 22:12, Mahmoud Al-Qudsi wrote:
> I think you and I are on the same page here, Clemens? I abhor the
> BOM, but the question is whether or not SQLite will cater to the fact
> that the bigger names in the industry appear hell-bent on shoving it
> in users’ documents by default.

> Given that ‘.import’ and ‘.mode csv’ are “user mode” commands,
> perhaps leeway can be shown in breaking with standards for the sake
> of compatibility and sanity?

IMHO, this is not a good way to show a leeway. The Unicode Standard has
enough bad things in itself. It is not necessary to transform a good
Unicode's thing into a bad one.

Should SQLite disregard one <EF BB BF> sequence, or all <EF BB BF>
sequences, or at most 2, 3, 10 ones at the beginning of a file? Such
stream can be produced by a sequence of conversions done by a mix of
conforming and ``breaking the standard for the sake of compatibility''
converters.

To be clear: I understand your point very well - ``let's ignore optional
BOM at the beginning'', but I want to show that there is no limit in
such thinking. Why one optional? You have not pointed out what
compatibility with. The next step is to ignore N BOMs for the sake of
compatibility with breaking the standard for the sake of compatibility
with breaking the standard for the sake of... lim = \infty. I cannot see
any sanity here.

The standard says: ``Only UTF-16/32 (even not UTF-16/32LE/BE) encoding
forms can contain BOM''. Let's conform to this.

Certainly, there are no objections to extend an import's functionality
in such a way that it ignores the initial 0xFEFF. However, an import
should allow ZWNBSP as the first character, in its basic form, to be
conforming to the standard.

-- best regards

Cezary H. Noweta
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: UTF8-BOM not disregarded in CSV import

Eric Grange
Alas, there is no end in sight to the pain for the Unicode decision to not
make the BOM compulsory for UTF-8.

Making it optional or non-necessary basically made every single text file
ambiguous, with non-trivial heuristics and implicit conventions required
instead, resulting in character corruptions that are neither acceptable nor
understood by users.
Making it compulsory would have made pre-Unicode *nix command-line
utilities and C string code in need of fixing, much pain, sure, but in
retrospect, this would have been a much smarter choice as everything could
have been settled in matter of years.

But now, more than 20 years later, UTF-8 storage is still a mess, with no
end in sight :/


On Sun, Jun 25, 2017 at 9:16 PM, Cezary H. Noweta <[hidden email]>
wrote:

> Hello,
>
> On 2017-06-23 22:12, Mahmoud Al-Qudsi wrote:
>
>> I think you and I are on the same page here, Clemens? I abhor the
>> BOM, but the question is whether or not SQLite will cater to the fact
>> that the bigger names in the industry appear hell-bent on shoving it
>> in users’ documents by default.
>>
>
> Given that ‘.import’ and ‘.mode csv’ are “user mode” commands,
>> perhaps leeway can be shown in breaking with standards for the sake
>> of compatibility and sanity?
>>
>
> IMHO, this is not a good way to show a leeway. The Unicode Standard has
> enough bad things in itself. It is not necessary to transform a good
> Unicode's thing into a bad one.
>
> Should SQLite disregard one <EF BB BF> sequence, or all <EF BB BF>
> sequences, or at most 2, 3, 10 ones at the beginning of a file? Such
> stream can be produced by a sequence of conversions done by a mix of
> conforming and ``breaking the standard for the sake of compatibility''
> converters.
>
> To be clear: I understand your point very well - ``let's ignore optional
> BOM at the beginning'', but I want to show that there is no limit in
> such thinking. Why one optional? You have not pointed out what
> compatibility with. The next step is to ignore N BOMs for the sake of
> compatibility with breaking the standard for the sake of compatibility
> with breaking the standard for the sake of... lim = \infty. I cannot see
> any sanity here.
>
> The standard says: ``Only UTF-16/32 (even not UTF-16/32LE/BE) encoding
> forms can contain BOM''. Let's conform to this.
>
> Certainly, there are no objections to extend an import's functionality
> in such a way that it ignores the initial 0xFEFF. However, an import
> should allow ZWNBSP as the first character, in its basic form, to be
> conforming to the standard.
>
> -- best regards
>
> Cezary H. Noweta
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: UTF8-BOM not disregarded in CSV import

J Decker
In reply to this post by Cezary H. Noweta
On Sun, Jun 25, 2017 at 12:16 PM, Cezary H. Noweta <[hidden email]>
wrote:

> Hello,
>
>
> The standard says: ``Only UTF-16/32 (even not UTF-16/32LE/BE) encoding
> forms can contain BOM''. Let's conform to this.
>
>
I concur with that.

Since UTF-8 is only bytes; what would a BOM even change?  certainly longer
values composed of multiple bytes(16,32 bit values)  makes sense but not
when everything is only a byte to begin with?  It's not a BitOrderMark.



>
> Cezary H. Noweta
>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: UTF8-BOM not disregarded in CSV import

Rowan Worth-2
In reply to this post by Eric Grange
On 26 June 2017 at 15:09, Eric Grange <[hidden email]> wrote:

> Alas, there is no end in sight to the pain for the Unicode decision to not
> make the BOM compulsory for UTF-8.
>

UTF-8 is byte oriented. The very concept of byte order is nonsense in this
context as there is no multi-byte storage primitives to worry about.

Making it optional or non-necessary basically made every single text file
> ambiguous
>

Easily solved by never including a superflous BOM in UTF-8 text.


> But now, more than 20 years later, UTF-8 storage is still a mess, with no
> end in sight :/
>

s/a mess/the one bastion of sanity in a sea of codepage madness/

-Rowan
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: UTF8-BOM not disregarded in CSV import

Scott Robison-2
On Jun 26, 2017 1:47 AM, "Rowan Worth" <[hidden email]> wrote:

On 26 June 2017 at 15:09, Eric Grange <[hidden email]> wrote:

> Alas, there is no end in sight to the pain for the Unicode decision to not
> make the BOM compulsory for UTF-8.
>

UTF-8 is byte oriented. The very concept of byte order is nonsense in this
context as there is no multi-byte storage primitives to worry about.

Making it optional or non-necessary basically made every single text file
> ambiguous
>

Easily solved by never including a superflous BOM in UTF-8 text.


Some people talk about dialing a phone or referring to a remote control as
a clicker, even though most of us don't use pulse based dialing or remote
controls that actually click.

The reality is that interchange of text requires some means to communicate
the encoding, in band or out of band. ZWNBSP (now BOM) was selected as a
handy in band way to distinguish LE from BE fixed size multi-byte text. One
could just as easily call that stupid and demand everyone use network byte
order.

Byte Order Mark isn't perfectly descriptive when used with UTF-8. Neither
is dialing a cell phone. Language evolves.

Maybe people would prefer calling it TEI (Text Encoding Identifier). Then
we could get back to discussion of whether or not stripping U+FEFF from the
beginning of text streams is a good idea. I'm not advocating one way or
another, but if a system strips U+FEFF from a text stream after using it to
determine the encoding, surely it is reasonable to expect that for all
supported encodings. If it doesn't do that for one, it shouldn't do it for
any.

Does SQLite3 support UTF-16 CSV files with BOM/TEI? If not, then UTF-8 need
not. If so, perhaps it should.

As for using a signature at the beginning of UTF-8 text, it certainly can
be useful to distinguish Unicode from code pages & other incompatible
encodings.

That being said, it's not difficult to strip TEI from a file before passing
it to SQLite3 (or any other tool for that matter).
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: UTF8-BOM not disregarded in CSV import

Scott Robison-2
In reply to this post by Cezary H. Noweta
On Jun 25, 2017 1:16 PM, "Cezary H. Noweta" <[hidden email]> wrote:


Certainly, there are no objections to extend an import's functionality
in such a way that it ignores the initial 0xFEFF. However, an import
should allow ZWNBSP as the first character, in its basic form, to be
conforming to the standard.


If we're going to conform to the standard, U+FEFF has been deprecated as
ZWNBSP since Unicode 3.2 in 2002. U+2060 is the Word Joiner now. U+FEFF is
now "reserved" for differentiation of encodings at the beginning of a
stream of text. It may not be required or recommended, but it's not
forbidden either.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: UTF8-BOM not disregarded in CSV import

Rowan Worth-2
In reply to this post by Scott Robison-2
On 26 June 2017 at 16:55, Scott Robison <[hidden email]> wrote:

> Byte Order Mark isn't perfectly descriptive when used with UTF-8. Neither
> is dialing a cell phone. Language evolves.
>

It's not descriptive in the slightest because UTF-8's byte order is
*specified by the encoding*.

 I'm not advocating one way or
> another, but if a system strips U+FEFF from a text stream after using it to
> determine the encoding, surely it is reasonable to expect that for all
> supported encodings.
>

?? Are you going to strip 0xFE 0xFF from the front of my iso8859-1 encoded
stream and drop my beautiful smiley? þÿ
Different encodings demand different treatment. BOM is an artifact of
16/32-bit unicode encodings and can kindly keep its nose out of [the
relatively elegant] UTF-8.

-Rowan
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: UTF8-BOM not disregarded in CSV import

Eric Grange
>Easily solved by never including a superflous BOM in UTF-8 text

And that easy option has worked beautifully for 20 years... not.

Yes, BOM is a misnommer, yes it "wastes" 3 bytes, but in the real world
"text files" have a variety of encodings.
No BOM = you have to fire a whole suite of heuristics or present the user
with choices he/she will not understand.

After 20 years, the choice is between doing the best in an imperfect world,
or perpetuating the issue and blaming others.


On Mon, Jun 26, 2017 at 12:05 PM, Rowan Worth <[hidden email]> wrote:

> On 26 June 2017 at 16:55, Scott Robison <[hidden email]> wrote:
>
> > Byte Order Mark isn't perfectly descriptive when used with UTF-8. Neither
> > is dialing a cell phone. Language evolves.
> >
>
> It's not descriptive in the slightest because UTF-8's byte order is
> *specified by the encoding*.
>
>  I'm not advocating one way or
> > another, but if a system strips U+FEFF from a text stream after using it
> to
> > determine the encoding, surely it is reasonable to expect that for all
> > supported encodings.
> >
>
> ?? Are you going to strip 0xFE 0xFF from the front of my iso8859-1 encoded
> stream and drop my beautiful smiley? þÿ
> Different encodings demand different treatment. BOM is an artifact of
> 16/32-bit unicode encodings and can kindly keep its nose out of [the
> relatively elegant] UTF-8.
>
> -Rowan
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: UTF8-BOM not disregarded in CSV import

Richard Damon
In reply to this post by Eric Grange
On 6/26/17 3:09 AM, Eric Grange wrote:

> Alas, there is no end in sight to the pain for the Unicode decision to not
> make the BOM compulsory for UTF-8.
>
> Making it optional or non-necessary basically made every single text file
> ambiguous, with non-trivial heuristics and implicit conventions required
> instead, resulting in character corruptions that are neither acceptable nor
> understood by users.
> Making it compulsory would have made pre-Unicode *nix command-line
> utilities and C string code in need of fixing, much pain, sure, but in
> retrospect, this would have been a much smarter choice as everything could
> have been settled in matter of years.
>
> But now, more than 20 years later, UTF-8 storage is still a mess, with no
> end in sight :/
>
Perhaps the real issue wasn't in making the BOM mark optional, but in
giving it TWO uses, by defining the symbol as a Zero-Width Non Breaking
Space Character as well as the Byte Order Mark. If its ONLY purpose was
to allow for the optional marking of a file as being encoded with
Unicode, and with what flavor it was, then it wouldn't have been an
issue, all I/O input routines could freely drop it after marking the
input method for the file. Since it does have another meaning, things
become messy, and we are stuck with trying to decide which wrong thing
we should do,

--
Richard Damon

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: UTF8-BOM not disregarded in CSV import

Peter da Silva
In reply to this post by Eric Grange
On 6/26/17, 2:09 AM, "sqlite-users on behalf of Eric Grange" <[hidden email] on behalf of [hidden email]> wrote:
> Alas, there is no end in sight to the pain for the Unicode decision to not make the BOM compulsory for UTF-8.

It’s not actually providing any “byte order” information. It’s only used for round-tripping conversion from other formats that actually require one. Therefore it is not required.

Perhaps it should have been called “UTF-8 mark” instead? Then it could have been arguably recommended.

Regardless, it is what it is.

As for distinguishing UTF-8 from something like 8859.x or CP1255, if the string is all-7-bit it’s ASCII which can be safely treated as UTF-8. If it’s not, then

1. It wouldn’t have had a UTF-8 flag anyway, and
2. odds are very good it’s going to contain at least one byte that’s not valid UTF-8. Then you’re falling back to guessing which 8859.x variation to try.

My call is, just use UTF-8 everywhere and if you have some program that’s producing 8859.x or something else from the last century... fix it. It’s not the UTF-8 storage that’s the mess, it’s the non-UTF-8 storage.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: UTF8-BOM not disregarded in CSV import

jose isaias cabrera-3
In reply to this post by Eric Grange

At the bottom...

-----Original Message-----
From: Eric Grange
Sent: Monday, June 26, 2017 3:09 AM
To: SQLite mailing list
Subject: Re: [sqlite] UTF8-BOM not disregarded in CSV import

Alas, there is no end in sight to the pain for the Unicode decision to not
make the BOM compulsory for UTF-8.

Making it optional or non-necessary basically made every single text file
ambiguous, with non-trivial heuristics and implicit conventions required
instead, resulting in character corruptions that are neither acceptable nor
understood by users.
Making it compulsory would have made pre-Unicode *nix command-line
utilities and C string code in need of fixing, much pain, sure, but in
retrospect, this would have been a much smarter choice as everything could
have been settled in matter of years.

But now, more than 20 years later, UTF-8 storage is still a mess, with no
end in sight :/


On Sun, Jun 25, 2017 at 9:16 PM, Cezary H. Noweta <[hidden email]>
wrote:

> Hello,
>
> On 2017-06-23 22:12, Mahmoud Al-Qudsi wrote:
>
>> I think you and I are on the same page here, Clemens? I abhor the
>> BOM, but the question is whether or not SQLite will cater to the fact
>> that the bigger names in the industry appear hell-bent on shoving it
>> in users’ documents by default.
>>
>
> Given that ‘.import’ and ‘.mode csv’ are “user mode” commands,
>> perhaps leeway can be shown in breaking with standards for the sake
>> of compatibility and sanity?
>>
>
> IMHO, this is not a good way to show a leeway. The Unicode Standard has
> enough bad things in itself. It is not necessary to transform a good
> Unicode's thing into a bad one.
>
> Should SQLite disregard one <EF BB BF> sequence, or all <EF BB BF>
> sequences, or at most 2, 3, 10 ones at the beginning of a file? Such
> stream can be produced by a sequence of conversions done by a mix of
> conforming and ``breaking the standard for the sake of compatibility''
> converters.
>
> To be clear: I understand your point very well - ``let's ignore optional
> BOM at the beginning'', but I want to show that there is no limit in
> such thinking. Why one optional? You have not pointed out what
> compatibility with. The next step is to ignore N BOMs for the sake of
> compatibility with breaking the standard for the sake of compatibility
> with breaking the standard for the sake of... lim = \infty. I cannot see
> any sanity here.
>
> The standard says: ``Only UTF-16/32 (even not UTF-16/32LE/BE) encoding
> forms can contain BOM''. Let's conform to this.
>
> Certainly, there are no objections to extend an import's functionality
> in such a way that it ignores the initial 0xFEFF. However, an import
> should allow ZWNBSP as the first character, in its basic form, to be
> conforming to the standard.
>
> -- best regards
>
> Cezary H. Noweta
> _______________________________________________

I have made a desicion to always include the BOM in all my text files
whether they are UTF8, UTF16 or UTF32 little or big endian. I think all of
us should also.  Just because the "Unicode Gurus" didn't think so, does not
mean they are right.  I had a doctor give me the wrong diagnose. There were
just too many symptoms that looked alike and they chosed one and went with
it.  The same thing happened, the Unicode Gurus, they never thought about
the problems they would be causing today.  Some applications do not place
BOM on UTF8, UTF16 files, and then you have to go and find which one is it,
and decode the file correctly.  This can all be prevented by having a BOM.
Yes, I know I am saying everything every body is, but what I am also saying
is to let us all use the BOM, and also have every application we write
welcome the BOM.  One last thing, every application uses internal file
information to tell whether a file is able to be read by the application,
and whether the application version supports that version of that file, etc.
UTF8, UTF16, UTF32, litle or big Endian should have BOM.  Thanks.

josé

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: UTF8-BOM not disregarded in CSV import

Peter da Silva
In reply to this post by Peter da Silva
Just occurred to me: another problem with the BOM is that some people who are *not* writing UTF-8 are cargo-culting the BOM in anyway. So you may have to scan the whole file to see if it’s really UTF-8 anyway.

You’re better off just assuming UTF-8 everywhere, generating an error (and backing out the operation where possible) when you get a failure, and attacking the broken sources.

OTOH, defensive programming says drop all the BOMs on input anyway.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: UTF8-BOM not disregarded in CSV import

Simon Slavin-3
In reply to this post by Eric Grange
Folks, I’m sorry to interrupt but I’ve just woken up to 11 posts in this thread and I see a lot of inaccurate 'facts' posted here.  Rather than pick up on statements in individual posts (which would unfairly pick on some people as being less accurate than others) I’d like to post facts straight from Unicode.org and let you reassess some of the things written earlier.

Position of BOM
---------------

A Byte Order Mark is valid only at the beginning of a data stream.  You never need to scan a file for it.  If you find the sequence of characters for a BOM in the middle of a datastream, it’s not a BOM and you should handle it as if those were Unicode characters in the current encoding (for example ZERO WIDTH NON-BREAKING SPACE).  There is no unicode sequence which means "Encoding is changing.  The next sequence is the new BOM."

If you look at the first few bytes of a file and can’t identify one of the BOMs, there isn’t (a valid) one for that data stream and you can assume the default which is UTF-8.  This is done to allow the use of ASCII text in a datastream which was designed for Unicode.  If you do not implement it, your software will fail for inputs limited by small chipsets or old APIs which can handle only ASCII.

What BOMs indicate
------------------

BOMs indicate both which type of UTF is in use as well as the byte order.  In other words you can not only tell UTF-16LE from UT-16BE, but you can also tell UTF-32LE from UTF-16LE.  To identify the encoding, check the beginning of the datastream for these five sequences, starting from the first one listed:

00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF       UTF-16, big-endian
FF FE       UTF-16, little-endian
EF BB BF   UTF-8

As you can see, Having a datastream start with FE FF does not definitely tell you that it’s a UTF-16 datastream.  Be careful.  Also be careful of software/protocols/APIs which assume that 00 bytes indicate the end of a datastream.

As you can see, although the BOMs for 16 and 32 bit formats are the same size as those formats, this is not true of the BOM for UTF-8.  Be careful.

How to handle BOMs in software/protocols/APIs
——————————————————————

Establish whether each field can handle all kinds of Unicode and understands BOMs, or whether the field understands only one kind of Unicode.  If the latter, state this in the documentation, including which kind of Unicode it understands.

There is no convention for "This software understands both UTF-16BE and UTF-16LE but nothing else.".  If it handles any BOMs, it should handle all five.  However, it can handle them by identifying, for example, UTF-32BE and returning an error indicating that it can’t handle any encodings which aren’t 16 bit.

Try to be consistent across all fields in your protocol/API.

References:

<http://unicode.org/faq/utf_bom.html>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: UTF8-BOM not disregarded in CSV import

Scott Robison-2
In reply to this post by Rowan Worth-2
On Jun 26, 2017 4:05 AM, "Rowan Worth" <[hidden email]> wrote:

On 26 June 2017 at 16:55, Scott Robison <[hidden email]> wrote:

> Byte Order Mark isn't perfectly descriptive when used with UTF-8. Neither
> is dialing a cell phone. Language evolves.
>

It's not descriptive in the slightest because UTF-8's byte order is
*specified by the encoding*.


I fear you may not have read my entire email or at least have missed my
point.

 I'm not advocating one way or
> another, but if a system strips U+FEFF from a text stream after using it
to
> determine the encoding, surely it is reasonable to expect that for all
> supported encodings.
>

?? Are you going to strip 0xFE 0xFF from the front of my iso8859-1 encoded
stream and drop my beautiful smiley? þÿ
Different encodings demand different treatment. BOM is an artifact of
16/32-bit unicode encodings and can kindly keep its nose out of [the
relatively elegant] UTF-8.


One, I'm not going to do anything. Two, clearly I'm taking about the three
byte UTF-8 sequence that decodes to U+FEFF. Three, you are correct about
different encodings. I was trying to move the discussion past the idea of
byte order when what we're really talking about is encoding detection.
ZWNBSP was used for encoding detection because it had a convenient property
that allowed differentiation between multiple encodings and could be safely
ignored. The fact that the Unicode folks renamed it BOM instead of TEI or
BEM or whatever doesn't mean it can't be used with other unicode
transformations. It is neither required, recommended, nor forbidden with
UTF-8, it's up to systems exchanging data to decide how to deal with it.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: UTF8-BOM not disregarded in CSV import

Peter da Silva
In reply to this post by Simon Slavin-3
I didn’t mean to imply you had to scan the whole content for a BOM, but rather for illegal characters in the absence of a BOM.

On 6/26/17, 10:02 AM, "sqlite-users on behalf of Simon Slavin" <[hidden email] on behalf of [hidden email]> wrote:

    Folks, I’m sorry to interrupt but I’ve just woken up to 11 posts in this thread and I see a lot of inaccurate 'facts' posted here.  Rather than pick up on statements in individual posts (which would unfairly pick on some people as being less accurate than others) I’d like to post facts straight from Unicode.org and let you reassess some of the things written earlier.
   
    Position of BOM
    ---------------
   
    A Byte Order Mark is valid only at the beginning of a data stream.  You never need to scan a file for it.  If you find the sequence of characters for a BOM in the middle of a datastream, it’s not a BOM and you should handle it as if those were Unicode characters in the current encoding (for example ZERO WIDTH NON-BREAKING SPACE).  There is no unicode sequence which means "Encoding is changing.  The next sequence is the new BOM."
   
    If you look at the first few bytes of a file and can’t identify one of the BOMs, there isn’t (a valid) one for that data stream and you can assume the default which is UTF-8.  This is done to allow the use of ASCII text in a datastream which was designed for Unicode.  If you do not implement it, your software will fail for inputs limited by small chipsets or old APIs which can handle only ASCII.
   
    What BOMs indicate
    ------------------
   
    BOMs indicate both which type of UTF is in use as well as the byte order.  In other words you can not only tell UTF-16LE from UT-16BE, but you can also tell UTF-32LE from UTF-16LE.  To identify the encoding, check the beginning of the datastream for these five sequences, starting from the first one listed:
   
    00 00 FE FF UTF-32, big-endian
    FF FE 00 00 UTF-32, little-endian
    FE FF       UTF-16, big-endian
    FF FE       UTF-16, little-endian
    EF BB BF   UTF-8
   
    As you can see, Having a datastream start with FE FF does not definitely tell you that it’s a UTF-16 datastream.  Be careful.  Also be careful of software/protocols/APIs which assume that 00 bytes indicate the end of a datastream.
   
    As you can see, although the BOMs for 16 and 32 bit formats are the same size as those formats, this is not true of the BOM for UTF-8.  Be careful.
   
    How to handle BOMs in software/protocols/APIs
    ——————————————————————
   
    Establish whether each field can handle all kinds of Unicode and understands BOMs, or whether the field understands only one kind of Unicode.  If the latter, state this in the documentation, including which kind of Unicode it understands.
   
    There is no convention for "This software understands both UTF-16BE and UTF-16LE but nothing else.".  If it handles any BOMs, it should handle all five.  However, it can handle them by identifying, for example, UTF-32BE and returning an error indicating that it can’t handle any encodings which aren’t 16 bit.
   
    Try to be consistent across all fields in your protocol/API.
   
    References:
   
    <http://unicode.org/faq/utf_bom.html>
    _______________________________________________
    sqlite-users mailing list
    [hidden email]
    http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
   

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: UTF8-BOM not disregarded in CSV import

Scott Robison-2
In reply to this post by Simon Slavin-3
On Jun 26, 2017 9:02 AM, "Simon Slavin" <[hidden email]> wrote:

There is no convention for "This software understands both UTF-16BE and
UTF-16LE but nothing else.".  If it handles any BOMs, it should handle all
five.  However, it can handle them by identifying, for example, UTF-32BE
and returning an error indicating that it can’t handle any encodings which
aren’t 16 bit.

Try to be consistent across all fields in your protocol/API.

References:

<http://unicode.org/faq/utf_bom.html>


+1

FAQ quote:

Q: When a BOM is used, is it only in 16-bit Unicode text?

A: No, a BOM can be used as a signature no matter how the Unicode text is
transformed: UTF-16, UTF-8, or UTF-32.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: UTF8-BOM not disregarded in CSV import

Cezary H. Noweta
In reply to this post by jose isaias cabrera-3
On 2017-06-26 15:01, jose isaias cabrera wrote:

> I have made a desicion to always include the BOM in all my text files
>  whether they are UTF8, UTF16 or UTF32 little or big endian. I think
> all of us should also.

I'm sorry, if I introduced ambiguity, but I had described SQLite's and
SQLite shell's behavior -- the sole. I'm not entreating to kill all
UTF-8 BOMs in the universe.

> Just because the "Unicode Gurus" didn't think so, does not mean they
> are right.  I had a doctor give me the wrong diagnose. There were
> just too many symptoms that looked alike and they chosed one and went
> with it.  The same thing happened, the Unicode Gurus, they never
> thought about the problems they would be causing today.  Some
> applications do not place BOM on UTF8, UTF16 files, and then you have
> to go and find which one is it, and decode the file correctly.

The problem, which you described, had not been introduced nor created by
``Unicode Gurus''. AFAIR, finding of a correct encoding/codepage (of
files with an unknown origin) was present in the olden days, far before
Unicode. UTF-8 is far easier recognizable then others.

> This can all be prevented by having a BOM.

This would have helped, if it had been only UTF-8 and some single-byte
code page.

> Yes, I know I am saying everything every body is, but what I am also
> saying is to let us all use the BOM, and also have every application
> we write welcome the BOM.

What if I want to place 0xFEFF at the beginning of UTF-8? The second EF
BB BF as BOM? OK - but the standard says ``there is no BOM''. This is
what the standard is for. I agree with you -- where a character set is
unmarked, there UTF-8 BOM is useful as an encoding signature. However,
where SQLite is accepting only UTF-8, there I expect that placing EF BB
BF at the beginning will be interpreted as codepoint -- not BOM, until
it will say explicitly: ``My interpretation of EF BB BF is BOM''.

I am not going to tell whether zero or all of my UTF-8 files have BOM
neither whether or not to use/kill/welcome all UTF-8 BOMs -- it does not
matter. However, in case of SQLite, Clemens' arguments are very
stringent -- I hope SQLite shell's behavior will not change. For the
sake of the standard conformance, thus predictability and determinedness.

-- best regards

Cezary H. Noweta
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
12
Loading...