# UTF8-BOM not disregarded in CSV import

24 messages
12
Open this post in threaded view
|
Report Content as Inappropriate

## UTF8-BOM not disregarded in CSV import

 Hello all, Let me start off with my apologies if this is a documented issue; I did search the fossil tickets but did not find anything for “BOM”. As of SQLite 3.19.3, under .mode csv and with .import ……, SQLite3 includes a BOM (UTF-8) as part of the first column of the first record. IMHO, this is of particular importance since the latest versions of MS Excel default to “UTF-8 CSV” which includes a BOM. Would anyone be opposed to a patch to SQLite that disregarded a BOM when found during a csv import operation? Thank you kindly, Mahmoud Al-Qudsi NeoSmart Technologies _______________________________________________ sqlite-users mailing list [hidden email] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Open this post in threaded view
|
Report Content as Inappropriate

## Re: UTF8-BOM not disregarded in CSV import

 Mahmoud Al-Qudsi wrote: > with .import ……, SQLite3 includes a BOM (UTF-8) as part of the first > column of the first record. The Unicode Standard 9.0 says in section 3.10: | When represented in UTF-8, the byte order mark turns into the byte | sequence . Its usage at the beginning of a UTF-8 data stream | is neither required nor recommended by the Unicode Standard, so you should not use it. Treating this character as a zero width no-break space, and keeping it, is a correct interpretation of the file. > IMHO, this is of particular importance since the latest versions of MS > Excel default to “UTF-8 CSV” which includes a BOM. That's wrong: | When converting between different encoding schemes, extreme care must | be taken in handling any initial byte order marks. For example, if one | converted a UTF-16 byte serialization with an initial byte order mark | to a UTF-8 byte serialization, thereby converting the byte order mark | to in the UTF-8 form, the would now be ambiguous | as to its status as a byte order mark (from its source) or as an | initial zero width no-break space. If the UTF-8 byte serialization | were then converted to UTF-16BE and the initial were | converted to , the interpretation of the U+FEFF character would | have been modified by the conversion. This would be nonconformant | behavior according to conformance clause C7, because the change | between byte serializations would have resulted in modification of the | interpretation of the text. This is one reason why the use of the | initial byte sequence as a signature on UTF-8 byte | sequences is not recommended by the Unicode Standard. And Google Docs also thinks it would be a good idea to act against this recommendation: > Would anyone be opposed to a patch to SQLite that disregarded a BOM > when found during a csv import operation? Well, being wrong doesn't mean that Microsoft or Google will change their behaviour ... Regards, Clemens _______________________________________________ sqlite-users mailing list [hidden email] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Open this post in threaded view
|
Report Content as Inappropriate

## Re: UTF8-BOM not disregarded in CSV import

 I think you and I are on the same page here, Clemens? I abhor the BOM, but the question is whether or not SQLite will cater to the fact that the bigger names in the industry appear hell-bent on shoving it in users’ documents by default. Given that ‘.import’ and ‘.mode csv’ are “user mode” commands, perhaps leeway can be shown in breaking with standards for the sake of compatibility and sanity? Mahmoud From: Clemens Ladisch Sent: Friday, June 23, 2017 2:25 AM To: [hidden email] Subject: Re: [sqlite] UTF8-BOM not disregarded in CSV import Mahmoud Al-Qudsi wrote: > with .import ……, SQLite3 includes a BOM (UTF-8) as part of the first > column of the first record. The Unicode Standard 9.0 says in section 3.10: | When represented in UTF-8, the byte order mark turns into the byte | sequence . Its usage at the beginning of a UTF-8 data stream | is neither required nor recommended by the Unicode Standard, so you should not use it. Treating this character as a zero width no-break space, and keeping it, is a correct interpretation of the file. > IMHO, this is of particular importance since the latest versions of MS > Excel default to “UTF-8 CSV” which includes a BOM. That's wrong: | When converting between different encoding schemes, extreme care must | be taken in handling any initial byte order marks. For example, if one | converted a UTF-16 byte serialization with an initial byte order mark | to a UTF-8 byte serialization, thereby converting the byte order mark | to in the UTF-8 form, the would now be ambiguous | as to its status as a byte order mark (from its source) or as an | initial zero width no-break space. If the UTF-8 byte serialization | were then converted to UTF-16BE and the initial were | converted to , the interpretation of the U+FEFF character would | have been modified by the conversion. This would be nonconformant | behavior according to conformance clause C7, because the change | between byte serializations would have resulted in modification of the | interpretation of the text. This is one reason why the use of the | initial byte sequence as a signature on UTF-8 byte | sequences is not recommended by the Unicode Standard. And Google Docs also thinks it would be a good idea to act against this recommendation: > Would anyone be opposed to a patch to SQLite that disregarded a BOM > when found during a csv import operation? Well, being wrong doesn't mean that Microsoft or Google will change their behaviour ... Regards, Clemens _______________________________________________ sqlite-users mailing list [hidden email] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users_______________________________________________ sqlite-users mailing list [hidden email] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Open this post in threaded view
|
Report Content as Inappropriate

## Re: UTF8-BOM not disregarded in CSV import

 Hello, On 2017-06-23 22:12, Mahmoud Al-Qudsi wrote: > I think you and I are on the same page here, Clemens? I abhor the > BOM, but the question is whether or not SQLite will cater to the fact > that the bigger names in the industry appear hell-bent on shoving it > in users’ documents by default. > Given that ‘.import’ and ‘.mode csv’ are “user mode” commands, > perhaps leeway can be shown in breaking with standards for the sake > of compatibility and sanity? IMHO, this is not a good way to show a leeway. The Unicode Standard has enough bad things in itself. It is not necessary to transform a good Unicode's thing into a bad one. Should SQLite disregard one sequence, or all sequences, or at most 2, 3, 10 ones at the beginning of a file? Such stream can be produced by a sequence of conversions done by a mix of conforming and breaking the standard for the sake of compatibility'' converters. To be clear: I understand your point very well - let's ignore optional BOM at the beginning'', but I want to show that there is no limit in such thinking. Why one optional? You have not pointed out what compatibility with. The next step is to ignore N BOMs for the sake of compatibility with breaking the standard for the sake of compatibility with breaking the standard for the sake of... lim = \infty. I cannot see any sanity here. The standard says: Only UTF-16/32 (even not UTF-16/32LE/BE) encoding forms can contain BOM''. Let's conform to this. Certainly, there are no objections to extend an import's functionality in such a way that it ignores the initial 0xFEFF. However, an import should allow ZWNBSP as the first character, in its basic form, to be conforming to the standard. -- best regards Cezary H. Noweta _______________________________________________ sqlite-users mailing list [hidden email] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Open this post in threaded view
|
Report Content as Inappropriate

## Re: UTF8-BOM not disregarded in CSV import

 Alas, there is no end in sight to the pain for the Unicode decision to not make the BOM compulsory for UTF-8. Making it optional or non-necessary basically made every single text file ambiguous, with non-trivial heuristics and implicit conventions required instead, resulting in character corruptions that are neither acceptable nor understood by users. Making it compulsory would have made pre-Unicode *nix command-line utilities and C string code in need of fixing, much pain, sure, but in retrospect, this would have been a much smarter choice as everything could have been settled in matter of years. But now, more than 20 years later, UTF-8 storage is still a mess, with no end in sight :/ On Sun, Jun 25, 2017 at 9:16 PM, Cezary H. Noweta <[hidden email]> wrote: > Hello, > > On 2017-06-23 22:12, Mahmoud Al-Qudsi wrote: > >> I think you and I are on the same page here, Clemens? I abhor the >> BOM, but the question is whether or not SQLite will cater to the fact >> that the bigger names in the industry appear hell-bent on shoving it >> in users’ documents by default. >> > > Given that ‘.import’ and ‘.mode csv’ are “user mode” commands, >> perhaps leeway can be shown in breaking with standards for the sake >> of compatibility and sanity? >> > > IMHO, this is not a good way to show a leeway. The Unicode Standard has > enough bad things in itself. It is not necessary to transform a good > Unicode's thing into a bad one. > > Should SQLite disregard one sequence, or all > sequences, or at most 2, 3, 10 ones at the beginning of a file? Such > stream can be produced by a sequence of conversions done by a mix of > conforming and breaking the standard for the sake of compatibility'' > converters. > > To be clear: I understand your point very well - let's ignore optional > BOM at the beginning'', but I want to show that there is no limit in > such thinking. Why one optional? You have not pointed out what > compatibility with. The next step is to ignore N BOMs for the sake of > compatibility with breaking the standard for the sake of compatibility > with breaking the standard for the sake of... lim = \infty. I cannot see > any sanity here. > > The standard says: Only UTF-16/32 (even not UTF-16/32LE/BE) encoding > forms can contain BOM''. Let's conform to this. > > Certainly, there are no objections to extend an import's functionality > in such a way that it ignores the initial 0xFEFF. However, an import > should allow ZWNBSP as the first character, in its basic form, to be > conforming to the standard. > > -- best regards > > Cezary H. Noweta > _______________________________________________ > sqlite-users mailing list > [hidden email] > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users> _______________________________________________ sqlite-users mailing list [hidden email] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Open this post in threaded view
|
Report Content as Inappropriate

## Re: UTF8-BOM not disregarded in CSV import

 In reply to this post by Cezary H. Noweta On Sun, Jun 25, 2017 at 12:16 PM, Cezary H. Noweta <[hidden email]> wrote: > Hello, > > > The standard says: Only UTF-16/32 (even not UTF-16/32LE/BE) encoding > forms can contain BOM''. Let's conform to this. > > I concur with that. Since UTF-8 is only bytes; what would a BOM even change?  certainly longer values composed of multiple bytes(16,32 bit values)  makes sense but not when everything is only a byte to begin with?  It's not a BitOrderMark. > > Cezary H. Noweta > > _______________________________________________ > sqlite-users mailing list > [hidden email] > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users> _______________________________________________ sqlite-users mailing list [hidden email] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Open this post in threaded view
|
Report Content as Inappropriate

## Re: UTF8-BOM not disregarded in CSV import

 In reply to this post by Eric Grange On 26 June 2017 at 15:09, Eric Grange <[hidden email]> wrote: > Alas, there is no end in sight to the pain for the Unicode decision to not > make the BOM compulsory for UTF-8. > UTF-8 is byte oriented. The very concept of byte order is nonsense in this context as there is no multi-byte storage primitives to worry about. Making it optional or non-necessary basically made every single text file > ambiguous > Easily solved by never including a superflous BOM in UTF-8 text. > But now, more than 20 years later, UTF-8 storage is still a mess, with no > end in sight :/ > s/a mess/the one bastion of sanity in a sea of codepage madness/ -Rowan _______________________________________________ sqlite-users mailing list [hidden email] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Open this post in threaded view
|
Report Content as Inappropriate

## Re: UTF8-BOM not disregarded in CSV import

 On Jun 26, 2017 1:47 AM, "Rowan Worth" <[hidden email]> wrote: On 26 June 2017 at 15:09, Eric Grange <[hidden email]> wrote: > Alas, there is no end in sight to the pain for the Unicode decision to not > make the BOM compulsory for UTF-8. > UTF-8 is byte oriented. The very concept of byte order is nonsense in this context as there is no multi-byte storage primitives to worry about. Making it optional or non-necessary basically made every single text file > ambiguous > Easily solved by never including a superflous BOM in UTF-8 text. Some people talk about dialing a phone or referring to a remote control as a clicker, even though most of us don't use pulse based dialing or remote controls that actually click. The reality is that interchange of text requires some means to communicate the encoding, in band or out of band. ZWNBSP (now BOM) was selected as a handy in band way to distinguish LE from BE fixed size multi-byte text. One could just as easily call that stupid and demand everyone use network byte order. Byte Order Mark isn't perfectly descriptive when used with UTF-8. Neither is dialing a cell phone. Language evolves. Maybe people would prefer calling it TEI (Text Encoding Identifier). Then we could get back to discussion of whether or not stripping U+FEFF from the beginning of text streams is a good idea. I'm not advocating one way or another, but if a system strips U+FEFF from a text stream after using it to determine the encoding, surely it is reasonable to expect that for all supported encodings. If it doesn't do that for one, it shouldn't do it for any. Does SQLite3 support UTF-16 CSV files with BOM/TEI? If not, then UTF-8 need not. If so, perhaps it should. As for using a signature at the beginning of UTF-8 text, it certainly can be useful to distinguish Unicode from code pages & other incompatible encodings. That being said, it's not difficult to strip TEI from a file before passing it to SQLite3 (or any other tool for that matter). _______________________________________________ sqlite-users mailing list [hidden email] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Open this post in threaded view
|
Report Content as Inappropriate

## Re: UTF8-BOM not disregarded in CSV import

 In reply to this post by Cezary H. Noweta On Jun 25, 2017 1:16 PM, "Cezary H. Noweta" <[hidden email]> wrote: Certainly, there are no objections to extend an import's functionality in such a way that it ignores the initial 0xFEFF. However, an import should allow ZWNBSP as the first character, in its basic form, to be conforming to the standard. If we're going to conform to the standard, U+FEFF has been deprecated as ZWNBSP since Unicode 3.2 in 2002. U+2060 is the Word Joiner now. U+FEFF is now "reserved" for differentiation of encodings at the beginning of a stream of text. It may not be required or recommended, but it's not forbidden either. _______________________________________________ sqlite-users mailing list [hidden email] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Open this post in threaded view
|
Report Content as Inappropriate

## Re: UTF8-BOM not disregarded in CSV import

 In reply to this post by Scott Robison-2 On 26 June 2017 at 16:55, Scott Robison <[hidden email]> wrote: > Byte Order Mark isn't perfectly descriptive when used with UTF-8. Neither > is dialing a cell phone. Language evolves. > It's not descriptive in the slightest because UTF-8's byte order is *specified by the encoding*.  I'm not advocating one way or > another, but if a system strips U+FEFF from a text stream after using it to > determine the encoding, surely it is reasonable to expect that for all > supported encodings. > ?? Are you going to strip 0xFE 0xFF from the front of my iso8859-1 encoded stream and drop my beautiful smiley? þÿ Different encodings demand different treatment. BOM is an artifact of 16/32-bit unicode encodings and can kindly keep its nose out of [the relatively elegant] UTF-8. -Rowan _______________________________________________ sqlite-users mailing list [hidden email] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Open this post in threaded view
|
Report Content as Inappropriate

## Re: UTF8-BOM not disregarded in CSV import

 >Easily solved by never including a superflous BOM in UTF-8 text And that easy option has worked beautifully for 20 years... not. Yes, BOM is a misnommer, yes it "wastes" 3 bytes, but in the real world "text files" have a variety of encodings. No BOM = you have to fire a whole suite of heuristics or present the user with choices he/she will not understand. After 20 years, the choice is between doing the best in an imperfect world, or perpetuating the issue and blaming others. On Mon, Jun 26, 2017 at 12:05 PM, Rowan Worth <[hidden email]> wrote: > On 26 June 2017 at 16:55, Scott Robison <[hidden email]> wrote: > > > Byte Order Mark isn't perfectly descriptive when used with UTF-8. Neither > > is dialing a cell phone. Language evolves. > > > > It's not descriptive in the slightest because UTF-8's byte order is > *specified by the encoding*. > >  I'm not advocating one way or > > another, but if a system strips U+FEFF from a text stream after using it > to > > determine the encoding, surely it is reasonable to expect that for all > > supported encodings. > > > > ?? Are you going to strip 0xFE 0xFF from the front of my iso8859-1 encoded > stream and drop my beautiful smiley? þÿ > Different encodings demand different treatment. BOM is an artifact of > 16/32-bit unicode encodings and can kindly keep its nose out of [the > relatively elegant] UTF-8. > > -Rowan > _______________________________________________ > sqlite-users mailing list > [hidden email] > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users> _______________________________________________ sqlite-users mailing list [hidden email] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Open this post in threaded view
|
Report Content as Inappropriate

## Re: UTF8-BOM not disregarded in CSV import

 In reply to this post by Eric Grange On 6/26/17 3:09 AM, Eric Grange wrote: > Alas, there is no end in sight to the pain for the Unicode decision to not > make the BOM compulsory for UTF-8. > > Making it optional or non-necessary basically made every single text file > ambiguous, with non-trivial heuristics and implicit conventions required > instead, resulting in character corruptions that are neither acceptable nor > understood by users. > Making it compulsory would have made pre-Unicode *nix command-line > utilities and C string code in need of fixing, much pain, sure, but in > retrospect, this would have been a much smarter choice as everything could > have been settled in matter of years. > > But now, more than 20 years later, UTF-8 storage is still a mess, with no > end in sight :/ > Perhaps the real issue wasn't in making the BOM mark optional, but in giving it TWO uses, by defining the symbol as a Zero-Width Non Breaking Space Character as well as the Byte Order Mark. If its ONLY purpose was to allow for the optional marking of a file as being encoded with Unicode, and with what flavor it was, then it wouldn't have been an issue, all I/O input routines could freely drop it after marking the input method for the file. Since it does have another meaning, things become messy, and we are stuck with trying to decide which wrong thing we should do, -- Richard Damon _______________________________________________ sqlite-users mailing list [hidden email] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Open this post in threaded view
|
Report Content as Inappropriate

## Re: UTF8-BOM not disregarded in CSV import

 In reply to this post by Eric Grange On 6/26/17, 2:09 AM, "sqlite-users on behalf of Eric Grange" <[hidden email] on behalf of [hidden email]> wrote: > Alas, there is no end in sight to the pain for the Unicode decision to not make the BOM compulsory for UTF-8. It’s not actually providing any “byte order” information. It’s only used for round-tripping conversion from other formats that actually require one. Therefore it is not required. Perhaps it should have been called “UTF-8 mark” instead? Then it could have been arguably recommended. Regardless, it is what it is. As for distinguishing UTF-8 from something like 8859.x or CP1255, if the string is all-7-bit it’s ASCII which can be safely treated as UTF-8. If it’s not, then 1. It wouldn’t have had a UTF-8 flag anyway, and 2. odds are very good it’s going to contain at least one byte that’s not valid UTF-8. Then you’re falling back to guessing which 8859.x variation to try. My call is, just use UTF-8 everywhere and if you have some program that’s producing 8859.x or something else from the last century... fix it. It’s not the UTF-8 storage that’s the mess, it’s the non-UTF-8 storage. _______________________________________________ sqlite-users mailing list [hidden email] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Open this post in threaded view
|
Report Content as Inappropriate

## Re: UTF8-BOM not disregarded in CSV import

 In reply to this post by Eric Grange At the bottom... -----Original Message----- From: Eric Grange Sent: Monday, June 26, 2017 3:09 AM To: SQLite mailing list Subject: Re: [sqlite] UTF8-BOM not disregarded in CSV import Alas, there is no end in sight to the pain for the Unicode decision to not make the BOM compulsory for UTF-8. Making it optional or non-necessary basically made every single text file ambiguous, with non-trivial heuristics and implicit conventions required instead, resulting in character corruptions that are neither acceptable nor understood by users. Making it compulsory would have made pre-Unicode *nix command-line utilities and C string code in need of fixing, much pain, sure, but in retrospect, this would have been a much smarter choice as everything could have been settled in matter of years. But now, more than 20 years later, UTF-8 storage is still a mess, with no end in sight :/ On Sun, Jun 25, 2017 at 9:16 PM, Cezary H. Noweta <[hidden email]> wrote: > Hello, > > On 2017-06-23 22:12, Mahmoud Al-Qudsi wrote: > >> I think you and I are on the same page here, Clemens? I abhor the >> BOM, but the question is whether or not SQLite will cater to the fact >> that the bigger names in the industry appear hell-bent on shoving it >> in users’ documents by default. >> > > Given that ‘.import’ and ‘.mode csv’ are “user mode” commands, >> perhaps leeway can be shown in breaking with standards for the sake >> of compatibility and sanity? >> > > IMHO, this is not a good way to show a leeway. The Unicode Standard has > enough bad things in itself. It is not necessary to transform a good > Unicode's thing into a bad one. > > Should SQLite disregard one sequence, or all > sequences, or at most 2, 3, 10 ones at the beginning of a file? Such > stream can be produced by a sequence of conversions done by a mix of > conforming and breaking the standard for the sake of compatibility'' > converters. > > To be clear: I understand your point very well - let's ignore optional > BOM at the beginning'', but I want to show that there is no limit in > such thinking. Why one optional? You have not pointed out what > compatibility with. The next step is to ignore N BOMs for the sake of > compatibility with breaking the standard for the sake of compatibility > with breaking the standard for the sake of... lim = \infty. I cannot see > any sanity here. > > The standard says: Only UTF-16/32 (even not UTF-16/32LE/BE) encoding > forms can contain BOM''. Let's conform to this. > > Certainly, there are no objections to extend an import's functionality > in such a way that it ignores the initial 0xFEFF. However, an import > should allow ZWNBSP as the first character, in its basic form, to be > conforming to the standard. > > -- best regards > > Cezary H. Noweta > _______________________________________________ I have made a desicion to always include the BOM in all my text files whether they are UTF8, UTF16 or UTF32 little or big endian. I think all of us should also.  Just because the "Unicode Gurus" didn't think so, does not mean they are right.  I had a doctor give me the wrong diagnose. There were just too many symptoms that looked alike and they chosed one and went with it.  The same thing happened, the Unicode Gurus, they never thought about the problems they would be causing today.  Some applications do not place BOM on UTF8, UTF16 files, and then you have to go and find which one is it, and decode the file correctly.  This can all be prevented by having a BOM. Yes, I know I am saying everything every body is, but what I am also saying is to let us all use the BOM, and also have every application we write welcome the BOM.  One last thing, every application uses internal file information to tell whether a file is able to be read by the application, and whether the application version supports that version of that file, etc. UTF8, UTF16, UTF32, litle or big Endian should have BOM.  Thanks. josé _______________________________________________ sqlite-users mailing list [hidden email] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Open this post in threaded view
|
Report Content as Inappropriate

## Re: UTF8-BOM not disregarded in CSV import

 In reply to this post by Peter da Silva Just occurred to me: another problem with the BOM is that some people who are *not* writing UTF-8 are cargo-culting the BOM in anyway. So you may have to scan the whole file to see if it’s really UTF-8 anyway. You’re better off just assuming UTF-8 everywhere, generating an error (and backing out the operation where possible) when you get a failure, and attacking the broken sources. OTOH, defensive programming says drop all the BOMs on input anyway. _______________________________________________ sqlite-users mailing list [hidden email] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Open this post in threaded view
|
Report Content as Inappropriate

## Re: UTF8-BOM not disregarded in CSV import

 In reply to this post by Eric Grange Folks, I’m sorry to interrupt but I’ve just woken up to 11 posts in this thread and I see a lot of inaccurate 'facts' posted here.  Rather than pick up on statements in individual posts (which would unfairly pick on some people as being less accurate than others) I’d like to post facts straight from Unicode.org and let you reassess some of the things written earlier. Position of BOM --------------- A Byte Order Mark is valid only at the beginning of a data stream.  You never need to scan a file for it.  If you find the sequence of characters for a BOM in the middle of a datastream, it’s not a BOM and you should handle it as if those were Unicode characters in the current encoding (for example ZERO WIDTH NON-BREAKING SPACE).  There is no unicode sequence which means "Encoding is changing.  The next sequence is the new BOM." If you look at the first few bytes of a file and can’t identify one of the BOMs, there isn’t (a valid) one for that data stream and you can assume the default which is UTF-8.  This is done to allow the use of ASCII text in a datastream which was designed for Unicode.  If you do not implement it, your software will fail for inputs limited by small chipsets or old APIs which can handle only ASCII. What BOMs indicate ------------------ BOMs indicate both which type of UTF is in use as well as the byte order.  In other words you can not only tell UTF-16LE from UT-16BE, but you can also tell UTF-32LE from UTF-16LE.  To identify the encoding, check the beginning of the datastream for these five sequences, starting from the first one listed: 00 00 FE FF UTF-32, big-endian FF FE 00 00 UTF-32, little-endian FE FF       UTF-16, big-endian FF FE       UTF-16, little-endian EF BB BF   UTF-8 As you can see, Having a datastream start with FE FF does not definitely tell you that it’s a UTF-16 datastream.  Be careful.  Also be careful of software/protocols/APIs which assume that 00 bytes indicate the end of a datastream. As you can see, although the BOMs for 16 and 32 bit formats are the same size as those formats, this is not true of the BOM for UTF-8.  Be careful. How to handle BOMs in software/protocols/APIs —————————————————————— Establish whether each field can handle all kinds of Unicode and understands BOMs, or whether the field understands only one kind of Unicode.  If the latter, state this in the documentation, including which kind of Unicode it understands. There is no convention for "This software understands both UTF-16BE and UTF-16LE but nothing else.".  If it handles any BOMs, it should handle all five.  However, it can handle them by identifying, for example, UTF-32BE and returning an error indicating that it can’t handle any encodings which aren’t 16 bit. Try to be consistent across all fields in your protocol/API. References: _______________________________________________ sqlite-users mailing list [hidden email] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Open this post in threaded view
|
Report Content as Inappropriate

## Re: UTF8-BOM not disregarded in CSV import

 In reply to this post by Rowan Worth-2 On Jun 26, 2017 4:05 AM, "Rowan Worth" <[hidden email]> wrote: On 26 June 2017 at 16:55, Scott Robison <[hidden email]> wrote: > Byte Order Mark isn't perfectly descriptive when used with UTF-8. Neither > is dialing a cell phone. Language evolves. > It's not descriptive in the slightest because UTF-8's byte order is *specified by the encoding*. I fear you may not have read my entire email or at least have missed my point.  I'm not advocating one way or > another, but if a system strips U+FEFF from a text stream after using it to > determine the encoding, surely it is reasonable to expect that for all > supported encodings. > ?? Are you going to strip 0xFE 0xFF from the front of my iso8859-1 encoded stream and drop my beautiful smiley? þÿ Different encodings demand different treatment. BOM is an artifact of 16/32-bit unicode encodings and can kindly keep its nose out of [the relatively elegant] UTF-8. One, I'm not going to do anything. Two, clearly I'm taking about the three byte UTF-8 sequence that decodes to U+FEFF. Three, you are correct about different encodings. I was trying to move the discussion past the idea of byte order when what we're really talking about is encoding detection. ZWNBSP was used for encoding detection because it had a convenient property that allowed differentiation between multiple encodings and could be safely ignored. The fact that the Unicode folks renamed it BOM instead of TEI or BEM or whatever doesn't mean it can't be used with other unicode transformations. It is neither required, recommended, nor forbidden with UTF-8, it's up to systems exchanging data to decide how to deal with it. _______________________________________________ sqlite-users mailing list [hidden email] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Open this post in threaded view
|
Report Content as Inappropriate

## Re: UTF8-BOM not disregarded in CSV import

 In reply to this post by Simon Slavin-3 I didn’t mean to imply you had to scan the whole content for a BOM, but rather for illegal characters in the absence of a BOM. On 6/26/17, 10:02 AM, "sqlite-users on behalf of Simon Slavin" <[hidden email] on behalf of [hidden email]> wrote:     Folks, I’m sorry to interrupt but I’ve just woken up to 11 posts in this thread and I see a lot of inaccurate 'facts' posted here.  Rather than pick up on statements in individual posts (which would unfairly pick on some people as being less accurate than others) I’d like to post facts straight from Unicode.org and let you reassess some of the things written earlier.         Position of BOM     ---------------         A Byte Order Mark is valid only at the beginning of a data stream.  You never need to scan a file for it.  If you find the sequence of characters for a BOM in the middle of a datastream, it’s not a BOM and you should handle it as if those were Unicode characters in the current encoding (for example ZERO WIDTH NON-BREAKING SPACE).  There is no unicode sequence which means "Encoding is changing.  The next sequence is the new BOM."         If you look at the first few bytes of a file and can’t identify one of the BOMs, there isn’t (a valid) one for that data stream and you can assume the default which is UTF-8.  This is done to allow the use of ASCII text in a datastream which was designed for Unicode.  If you do not implement it, your software will fail for inputs limited by small chipsets or old APIs which can handle only ASCII.         What BOMs indicate     ------------------         BOMs indicate both which type of UTF is in use as well as the byte order.  In other words you can not only tell UTF-16LE from UT-16BE, but you can also tell UTF-32LE from UTF-16LE.  To identify the encoding, check the beginning of the datastream for these five sequences, starting from the first one listed:         00 00 FE FF UTF-32, big-endian     FF FE 00 00 UTF-32, little-endian     FE FF       UTF-16, big-endian     FF FE       UTF-16, little-endian     EF BB BF   UTF-8         As you can see, Having a datastream start with FE FF does not definitely tell you that it’s a UTF-16 datastream.  Be careful.  Also be careful of software/protocols/APIs which assume that 00 bytes indicate the end of a datastream.         As you can see, although the BOMs for 16 and 32 bit formats are the same size as those formats, this is not true of the BOM for UTF-8.  Be careful.         How to handle BOMs in software/protocols/APIs     ——————————————————————         Establish whether each field can handle all kinds of Unicode and understands BOMs, or whether the field understands only one kind of Unicode.  If the latter, state this in the documentation, including which kind of Unicode it understands.         There is no convention for "This software understands both UTF-16BE and UTF-16LE but nothing else.".  If it handles any BOMs, it should handle all five.  However, it can handle them by identifying, for example, UTF-32BE and returning an error indicating that it can’t handle any encodings which aren’t 16 bit.         Try to be consistent across all fields in your protocol/API.         References:             _______________________________________________     sqlite-users mailing list     [hidden email]     http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users    _______________________________________________ sqlite-users mailing list [hidden email] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Open this post in threaded view
|
Report Content as Inappropriate

## Re: UTF8-BOM not disregarded in CSV import

 In reply to this post by Simon Slavin-3 On Jun 26, 2017 9:02 AM, "Simon Slavin" <[hidden email]> wrote: There is no convention for "This software understands both UTF-16BE and UTF-16LE but nothing else.".  If it handles any BOMs, it should handle all five.  However, it can handle them by identifying, for example, UTF-32BE and returning an error indicating that it can’t handle any encodings which aren’t 16 bit. Try to be consistent across all fields in your protocol/API. References: +1 FAQ quote: Q: When a BOM is used, is it only in 16-bit Unicode text? A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, or UTF-32. _______________________________________________ sqlite-users mailing list [hidden email] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
 In reply to this post by jose isaias cabrera-3 On 2017-06-26 15:01, jose isaias cabrera wrote: > I have made a desicion to always include the BOM in all my text files >  whether they are UTF8, UTF16 or UTF32 little or big endian. I think > all of us should also. I'm sorry, if I introduced ambiguity, but I had described SQLite's and SQLite shell's behavior -- the sole. I'm not entreating to kill all UTF-8 BOMs in the universe. > Just because the "Unicode Gurus" didn't think so, does not mean they > are right.  I had a doctor give me the wrong diagnose. There were > just too many symptoms that looked alike and they chosed one and went > with it.  The same thing happened, the Unicode Gurus, they never > thought about the problems they would be causing today.  Some > applications do not place BOM on UTF8, UTF16 files, and then you have > to go and find which one is it, and decode the file correctly. The problem, which you described, had not been introduced nor created by Unicode Gurus''. AFAIR, finding of a correct encoding/codepage (of files with an unknown origin) was present in the olden days, far before Unicode. UTF-8 is far easier recognizable then others. > This can all be prevented by having a BOM. This would have helped, if it had been only UTF-8 and some single-byte code page. > Yes, I know I am saying everything every body is, but what I am also > saying is to let us all use the BOM, and also have every application > we write welcome the BOM. What if I want to place 0xFEFF at the beginning of UTF-8? The second EF BB BF as BOM? OK - but the standard says there is no BOM''. This is what the standard is for. I agree with you -- where a character set is unmarked, there UTF-8 BOM is useful as an encoding signature. However, where SQLite is accepting only UTF-8, there I expect that placing EF BB BF at the beginning will be interpreted as codepoint -- not BOM, until it will say explicitly: My interpretation of EF BB BF is BOM''. I am not going to tell whether zero or all of my UTF-8 files have BOM neither whether or not to use/kill/welcome all UTF-8 BOMs -- it does not matter. However, in case of SQLite, Clemens' arguments are very stringent -- I hope SQLite shell's behavior will not change. For the sake of the standard conformance, thus predictability and determinedness. -- best regards Cezary H. Noweta _______________________________________________ sqlite-users mailing list [hidden email] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users