Re: [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Rowan Worth-2
On 26 June 2017 at 19:03, Eric Grange <[hidden email]> wrote:

> No BOM = you have to fire a whole suite of heuristics or present the user
> with choices he/she will not understand.
>

Requiring heuristics to determine text encoding/codepage exists regardless
of whether BOM is used since the problem predates unicode altogether. Lets
consider the scenarios - as Simon enumerated we're roughly interested in
cases where the data stream begins with byte sequences:

(1) 0x00 0x00 0xFE 0xFF
(2) 0xFF 0xFE 0x00 0x00
(3) 0xFE 0xFF
(4) 0xFF 0xFE
(5) 0xEF 0xBB 0xBF
(6) anything else (and datastream is ASCII or UTF-8)
(7) anything else (and datastream is some random codepage)

In case 7 we have little choice but to invoke heuristics or defer to the
user, yes? For the first 5 cases we can immediately deduce some facts:

(1) -> almost certainly UTF-32BE, athough if NUL characters may be present
8-bit codepages are still a candidate
(2) -> almost certainly UTF-32LE, although if NUL characters may be present
UTF-16LE and 8-bit codepages are still a candidate
(3) -> likely UTF16-BE, but could be some other 8-bit codepage
(4) -> likely UTF16-LE, but could be some other 8-bit codepage
(5) -> almost certainly UTF-8, but could be some other 8-bit codepage

I observe that BOM never provides perfect confidence regarding the
encoding, although in practice I expect it would only fail on data
specifically designed to fool it.

I also suggest that the checks ought to be performed in the order listed,
to avoid categorising UTF-32LE text as UTF-16LE, and because the first 4
cases rule out [valid] UTF-8 data.


Now lets say we make an assumption that all text is UTF-8 until proven
otherwise. In case 6 we get lucky and everything works, and in case 7 we
find invalid characters and fall back to heuristics or the user to identify
the encoding.

In fact using this assumption we could dispense with the BOM entirely for
UTF-8 and drop case 5 from the list. So my question is, what advantage does
a BOM offer for UTF-8? What other cases can we identify with the
information it provides?

If you were going to jump straight from case 5 to case 7 in the absence of
a BOM it seems like you might aswell give UTF-8 a try since it and ASCII
are far and away the common case.


I'm sure I've simplified things with this description - have I missed
something crucial? Is the BOM argument about future proofing? Are we
worried about EBCDIC? Is my perspective too anglo-centric?

After 20 years, the choice is between doing the best in an imperfect world,
> or perpetuating the issue and blaming others.
>

By being scalable and general enough to represent all desired characters,
as I see it UTF-8 is not perpetuating any issues but rather offering an out
from historic codepage woes (by adopting it as the go-to interchange
format).

As Peter da Silva said:
> It’s not the UTF-8 storage that’s the mess, it’s the non-UTF-8 storage.

-Rowan
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Scott Robison-2
On Jun 27, 2017 12:13 AM, "Rowan Worth" <[hidden email]> wrote:

I'm sure I've simplified things with this description - have I missed
something crucial? Is the BOM argument about future proofing? Are we
worried about EBCDIC? Is my perspective too anglo-centric?


The original issue was two of the largest companies in the world output the
Byte Encoding Mark(TM)(Patent Pending) (or BOM) at the beginning of UTF-8
encoded text streams, and it would be friendly for the SQLite3 shell to
skip it or use it for encoding identification in at least some cases.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Robert Hairgrove
On Tue, 2017-06-27 at 01:14 -0600, Scott Robison wrote:
> On Jun 27, 2017 12:13 AM, "Rowan Worth" <[hidden email]> wrote:
>
> I'm sure I've simplified things with this description - have I missed
> something crucial? Is the BOM argument about future proofing? Are we
> worried about EBCDIC? Is my perspective too anglo-centric?

Thanks, Scott -- nothing crucial, it is already quite good enough for
99% of use cases.

The Wikipedia page on "Byte Order Marks" appears to be quite
comprehensive and lists about a dozen possible BOM sequences:

https://en.wikipedia.org/wiki/Byte_order_mark

Lacking a BOM, I would certainly try to rule out UTF-8 right away by
searching for invalid UTF-8 characters within a reasonably large
portion of the input (maybe 100-300KB?) before then looking for any
NULL bytes (which are also invalid UTF-8 except as a delimiter) or
other random control characters.

As to having the user specify an encoding when dealing with something
which should be text (CSV files, for example) and processing files
which the user has specified, there is always the possibility that the
encoding is different than what the user says, mainly because they
probably clicked on a spreadsheet file with a similar name instead of
the desired text file. If the user specifies an 8-bit encoding aside
from Unicode, it gets very difficult to trap wrong input unless you
write routines to search for invalid characters (e.g. distinguishing
between true ISO-8859-x and CP1252).
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Robert Hairgrove
In reply to this post by Scott Robison-2
On Tue, 2017-06-27 at 01:14 -0600, Scott Robison wrote:
> The original issue was two of the largest companies in the world
> output the
> Byte Encoding Mark(TM)(Patent Pending) (or BOM) at the beginning of
> UTF-8
> encoded text streams, and it would be friendly for the SQLite3 shell
> to
> skip it or use it for encoding identification in at least some cases.

I would suggest adding a command-line argument to the shell indicating
whether to ignore a BOM or not, possibly requiring specification of a
certain encoding or list of encodings to consider.

Certainly this should not be a requirement for the library per se, but
a responsibility of the client to provide data in the proper encoding.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Eric Grange
> In case 7 we have little choice but to invoke heuristics or defer to the
> user, yes?

Yes in theory, but "no" in the real world, or rather "not in any way that
matters"

In the real world, text files are heavily skewed towards 8 bit formats,
meaning just three cases dominate the debate:
- ASCII / ANSI
- utf-8 with BOM
- utf-8 without BOM

And further, the overwhelming majority of text content are likely to
involve ASCII at the beginning (from various markups,
think html, xml, json, source code... even csv, because of explicit
separator specification or 1st column name).

So while in theory all the scenarios you describe are interesting, in
practice seeing an utf-8 BOM provides an extremely
high likeliness that a file will indeed be utf-8. Not always, but a memory
chip could also be hit by a cosmic ray.

Conversely the absence of an utf-8 BOM means a high probability of
"something undetermined": ANSI or BOMless utf-8,
or something more oddball (in which I lump utf-16 btw)... and the need for
heuristics to kick in.

Outside of source code and Linux config files, BOMless utf-8 are certainly
not the most frequent text files, ANSI and
other various encodings dominate, because most non-ASCII text files were
(are) produced under DOS or Windows,
where notepad and friends use ANSI by default f.i.

That may not be a desirable or happy situation, but that is the situation
we have to deal with.

It is also the reason why 20 years later the utf-8 BOM is still in use: it
explicit and has a practical success rate higher
than any of the heuristics, while the collisions of the BOM with actual
ANSI (or other) text start are unheard of.


On Tue, Jun 27, 2017 at 10:34 AM, Robert Hairgrove <[hidden email]>
wrote:

> On Tue, 2017-06-27 at 01:14 -0600, Scott Robison wrote:
> > The original issue was two of the largest companies in the world
> > output the
> > Byte Encoding Mark(TM)(Patent Pending) (or BOM) at the beginning of
> > UTF-8
> > encoded text streams, and it would be friendly for the SQLite3 shell
> > to
> > skip it or use it for encoding identification in at least some cases.
>
> I would suggest adding a command-line argument to the shell indicating
> whether to ignore a BOM or not, possibly requiring specification of a
> certain encoding or list of encodings to consider.
>
> Certainly this should not be a requirement for the library per se, but
> a responsibility of the client to provide data in the proper encoding.
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Robert Hairgrove
On Tue, 2017-06-27 at 12:42 +0200, Eric Grange wrote:
> In the real world, text files are heavily skewed towards 8 bit
> formats,
> meaning just three cases dominate the debate:
> - ASCII / ANSI
> - utf-8 with BOM
> - utf-8 without BOM

ASCII / ANSI is a 7-bit format.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Simon Slavin-3
In reply to this post by Rowan Worth-2


On 27 Jun 2017, at 7:12am, Rowan Worth <[hidden email]> wrote:

> In fact using this assumption we could dispense with the BOM entirely for
> UTF-8 and drop case 5 from the list.

If you do that, you will try to process the BOM at the beginning of a UTF-8 stream as if it is characters.

> So my question is, what advantage does
> a BOM offer for UTF-8? What other cases can we identify with the
> information it provides?

Suppose your software processes only UTF-8 files, but someone feeds it a file which begins with FE FF.  Your software should recognise this and reject the file, telling the user/programmer that it can’t process it because it’s in the wrong encoding.

Processing BOMs is part of the work you have to do to make your software Unicode-aware.  Without it, your documentation should state that your software handles the one flavour of Unicode it handles, not Unicode in general.  There’s nothing wrong with this, if it’s all the programmer/user needs, as long as it’s correctly documented.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Eric Grange
> ASCII / ANSI is a 7-bit format.

ASCII is a 7 bit encoding, but uses 8 bits in just about any implementation
out there. I do not think there is any 7 bit implementation still alive
outside of legacy mode for low-level wire protocols (RS232 etc.). I
personally have never encountered a 7 bit ASCII file (as in bitpacked), I
am curious if any exists?

ANSI has no precise definition, it's used to lump together all the <= 8 bit
legacy encodings (cf. https://en.wikipedia.org/wiki/ANSI_character_set)

On Tue, Jun 27, 2017 at 1:53 PM, Simon Slavin <[hidden email]> wrote:

>
>
> On 27 Jun 2017, at 7:12am, Rowan Worth <[hidden email]> wrote:
>
> > In fact using this assumption we could dispense with the BOM entirely for
> > UTF-8 and drop case 5 from the list.
>
> If you do that, you will try to process the BOM at the beginning of a
> UTF-8 stream as if it is characters.
>
> > So my question is, what advantage does
> > a BOM offer for UTF-8? What other cases can we identify with the
> > information it provides?
>
> Suppose your software processes only UTF-8 files, but someone feeds it a
> file which begins with FE FF.  Your software should recognise this and
> reject the file, telling the user/programmer that it can’t process it
> because it’s in the wrong encoding.
>
> Processing BOMs is part of the work you have to do to make your software
> Unicode-aware.  Without it, your documentation should state that your
> software handles the one flavour of Unicode it handles, not Unicode in
> general.  There’s nothing wrong with this, if it’s all the programmer/user
> needs, as long as it’s correctly documented.
>
> Simon.
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Robert Hairgrove
On Tue, 2017-06-27 at 16:38 +0200, Eric Grange wrote:

> >
> > ASCII / ANSI is a 7-bit format.
> ASCII is a 7 bit encoding, but uses 8 bits in just about any
> implementation
> out there. I do not think there is any 7 bit implementation still
> alive
> outside of legacy mode for low-level wire protocols (RS232 etc.). I
> personally have never encountered a 7 bit ASCII file (as in
> bitpacked), I
> am curious if any exists?

If an implementation "uses" 8 bits for ASCII text (as opposed to
hardware storage which is never less than 8 bits for a single C char,
AFAIK), then it is not a valid ASCII implementation, i.e. does not
interpret ASCII according to its definition. The whole point of
specifying a format as 7 bits is that the 8th bit is ignored, or
perhaps used in an implementation-defined manner, regardless of whether
the 8th bit in a char is available or not.

Once an encoding embraces 8 bits, it will be something like CP1252,
ISO-8859-x, KOI-R, etc. Just not ASCII.


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Keith Medcalf
 
> If an implementation "uses" 8 bits for ASCII text (as opposed to
> hardware storage which is never less than 8 bits for a single C char,
> AFAIK), then it is not a valid ASCII implementation, i.e. does not
> interpret ASCII according to its definition. The whole point of
> specifying a format as 7 bits is that the 8th bit is ignored, or
> perhaps used in an implementation-defined manner, regardless of whether
> the 8th bit in a char is available or not.

ASCII was designed back in the days of low reliability serial communications -- you know, back when data was sent using 7 bit data + 1 parity bits + 2 stop bits -- to increase the reliability of the communications.  A "byte" was also 9 bits.  8 bits of data and a parity bit.

Nowadays we use 8 bits for data with no parity, no error correction, and no timing bits.  Cuz when things screw up we want them to REALLY screw up ... and remain undetectable.





_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

John McKown
On Tue, Jun 27, 2017 at 4:02 PM, Keith Medcalf <[hidden email]> wrote:

>
> > If an implementation "uses" 8 bits for ASCII text (as opposed to
> > hardware storage which is never less than 8 bits for a single C char,
> > AFAIK), then it is not a valid ASCII implementation, i.e. does not
> > interpret ASCII according to its definition. The whole point of
> > specifying a format as 7 bits is that the 8th bit is ignored, or
> > perhaps used in an implementation-defined manner, regardless of whether
> > the 8th bit in a char is available or not.
>
> ASCII was designed back in the days of low reliability serial
> communications -- you know, back when data was sent using 7 bit data + 1
> parity bits + 2 stop bits -- to increase the reliability of the
> communications.  A "byte" was also 9 bits.  8 bits of data and a parity bit.
>
> Nowadays we use 8 bits for data with no parity, no error correction, and
> no timing bits.  Cuz when things screw up we want them to REALLY screw up
> ... and remain undetectable.
>

​Actually, most _enterprise_ level storage & transmission facilities have
error detection and correction codes which are "transparent" to the
programmer. Almost everybody knows about RAID arrays which (other than
JBOD) have either "parity" (RAID5 is an example) or is "mirrored" (RAID1).
Most have also heard of ECC RAM memory. But I'll bet that few have heard
of​ RAIM memory, which is used on the IBM z series of computers. Redundant
Array of Independent Memory. This is basically "RAID 5" memory. In addition
to the RAID-ness, it still uses ECC as well. Also, unlike with an Intel
machine, if an IBM z suffers a "memory failure", there is usually the
ability for the _hardware_ to recover all the data in the memory module
("block") and transparently copy it to a "phantom" block of memory, which
then takes the place of the block which contains the error. All without
host software intervention.

https://www.ibm.com/developerworks/community/blogs/e0c474f8-3aad-4f01-8bca-f2c12b576ac9/entry/IBM_zEnterprise_redundant_array_of_independent_memory_subsystem
?


--
Veni, Vidi, VISA: I came, I saw, I did a little shopping.

Maranatha! <><
John McKown
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Peter da Silva
In reply to this post by Keith Medcalf
On 6/27/17, 4:02 PM, "sqlite-users on behalf of Keith Medcalf" <[hidden email] on behalf of [hidden email]> wrote:
> Nowadays we use 8 bits for data with no parity, no error correction, and no timing bits.  Cuz when things screw up we want them to REALLY screw up ... and remain undetectable.
   
Nowadays we use packet checksums and retransmission of corrupted or missing packets.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Rowan Worth-2
In reply to this post by Eric Grange
On 27 June 2017 at 18:42, Eric Grange <[hidden email]> wrote:

> So while in theory all the scenarios you describe are interesting, in
> practice seeing an utf-8 BOM provides an extremely
> high likeliness that a file will indeed be utf-8. Not always, but a memory
> chip could also be hit by a cosmic ray.
>
> Conversely the absence of an utf-8 BOM means a high probability of
> "something undetermined": ANSI or BOMless utf-8,
> or something more oddball (in which I lump utf-16 btw)... and the need for
> heuristics to kick in.
>

I think we are largely in agreement here (esp. wrt utf-16 being an oddball
interchange format).

It doesn't answer my question though, ie. what advantage the BOM tag
provides compared to assuming utf-8 from the outset. Yes if you see a utf-8
BOM you have immediate confidence that the data is utf-8 encoded, but what
have you lost if you start with [fake] confidence and treat the data as
utf-8 until proven otherwise?

Either the data is utf-8, or ASCII, or ANSI but with no high-bit characters
and everything works, or you find an invalid byte sequence which gives you
high confidence that this is not actually utf-8 data. Granted it requires
more than three bytes lookahead, but we're gonig to be using that data
anyway.

I guess the one clear advantage I see of a utf-8 BOM is that it can
simplify some code, and reduce some duplicate work when interfacing with
APIs which both require a text encoding specified up-front and don't offer
a convenient error path when decoding fails. But adding utf-8 with BOM as
yet another text encoding configuration to the landscape seems like a high
price to pay, and certainly not an overall simplification.

Outside of source code and Linux config files, BOMless utf-8 are certainly
> not the most frequent text files, ANSI and
> other various encodings dominate, because most non-ASCII text files were
> (are) produced under DOS or Windows,
> where notepad and friends use ANSI by default f.i.
>

Notepad barely counts as a text editor (newlines are always two bytes long
yeah? :P), but I take your point that ANSI is common (especially CP1251?).
I've honestly never seen a utf-8 file *with* a BOM though, so perhaps I've
lived a sheltered life.

I'm not sure what you were going for here:

the overwhelming majority of text content are likely to involve ASCII at the

beginning (from various markups, think html, xml, json, source code... even
> csv


Since HTML's encoding is generally specified in the HTTP header or
<http-equiv> metadata.
XML's encoding must be specified on the first line (unless the default
utf-8 is used or a BOM is present).
JSON's encoding must be either utf-8, utf-16 or utf-32.
Source code encoding is generally defined by the language in question.

That may not be a desirable or happy situation, but that is the situation
> we have to deal with.
>

True, we're stuck with decisions of the past. I guess (and maybe I've
finally understood your position?) if a BOM was mandated for _all_ utf-8
data from the outset to clearly distinguish it from pre-existing ANSI
codepages then I could see its value. Although I remain a little revulsed
by having those three little bytes at the front of all my files to solve
what is predominently a transport issue ;)

-Rowan
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Tim Streater-3
On 28 Jun 2017 at 14:20, Rowan Worth <[hidden email]> wrote:

> On 27 June 2017 at 18:42, Eric Grange <[hidden email]> wrote:
>
>> So while in theory all the scenarios you describe are interesting, in
>> practice seeing an utf-8 BOM provides an extremely
>> high likeliness that a file will indeed be utf-8. Not always, but a memory
>> chip could also be hit by a cosmic ray.
>>
>> Conversely the absence of an utf-8 BOM means a high probability of
>> "something undetermined": ANSI or BOMless utf-8,
>> or something more oddball (in which I lump utf-16 btw)... and the need for
>> heuristics to kick in.
>>
>
> I think we are largely in agreement here (esp. wrt utf-16 being an oddball
> interchange format).
>
> It doesn't answer my question though, ie. what advantage the BOM tag
> provides compared to assuming utf-8 from the outset. Yes if you see a utf-8
> BOM you have immediate confidence that the data is utf-8 encoded, but what
> have you lost if you start with [fake] confidence and treat the data as
> utf-8 until proven otherwise?

1) Whether the data contained in a file is to be considered UTF-8 or not is an item of metadata about the file. As such, it has no business being part of the file itself. BOMs should therefore be deprecated.

2) I may receive data as part of an email, with a header such as:

           Content-type: text/plain; charset="utf-8"
           Content-Transfer-Encoding:  base64

then I interpret that to mean that the attendant data, after decoding from base64, is it to be expected to be utf-8. The sender, however, could be lying, and this needs to be considered. Just because a header, or file metadata, or indeed a BOM, says some data or other is legal utf-8, this does not mean that it actually is.


--
Cheers  --  Tim
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Eric Grange
> The sender, however, could be lying, and this needs to be considered

This is an orthogonal problem: if the sender is sending you data that is
not what it should be, then he could just as well be sending you
well-encoded and well-formed but invalid data, or malware, or
confidential/personal data you are not legally allowed to store, or, or,
or... the list never ends.

And generally speaking, if your code tries too hard to find a possible
interpretation for invalid of malformed input, then you are far more likely
to just end up with processed garbage, which will make it even harder to
figure out down the road where the garbage in your database originated from
(incorrect input? bug in the heuristics? etc.)




On Wed, Jun 28, 2017 at 10:40 PM, Tim Streater <[hidden email]> wrote:

> On 28 Jun 2017 at 14:20, Rowan Worth <[hidden email]> wrote:
>
> > On 27 June 2017 at 18:42, Eric Grange <[hidden email]> wrote:
> >
> >> So while in theory all the scenarios you describe are interesting, in
> >> practice seeing an utf-8 BOM provides an extremely
> >> high likeliness that a file will indeed be utf-8. Not always, but a
> memory
> >> chip could also be hit by a cosmic ray.
> >>
> >> Conversely the absence of an utf-8 BOM means a high probability of
> >> "something undetermined": ANSI or BOMless utf-8,
> >> or something more oddball (in which I lump utf-16 btw)... and the need
> for
> >> heuristics to kick in.
> >>
> >
> > I think we are largely in agreement here (esp. wrt utf-16 being an
> oddball
> > interchange format).
> >
> > It doesn't answer my question though, ie. what advantage the BOM tag
> > provides compared to assuming utf-8 from the outset. Yes if you see a
> utf-8
> > BOM you have immediate confidence that the data is utf-8 encoded, but
> what
> > have you lost if you start with [fake] confidence and treat the data as
> > utf-8 until proven otherwise?
>
> 1) Whether the data contained in a file is to be considered UTF-8 or not
> is an item of metadata about the file. As such, it has no business being
> part of the file itself. BOMs should therefore be deprecated.
>
> 2) I may receive data as part of an email, with a header such as:
>
>            Content-type: text/plain; charset="utf-8"
>            Content-Transfer-Encoding:  base64
>
> then I interpret that to mean that the attendant data, after decoding from
> base64, is it to be expected to be utf-8. The sender, however, could be
> lying, and this needs to be considered. Just because a header, or file
> metadata, or indeed a BOM, says some data or other is legal utf-8, this
> does not mean that it actually is.
>
>
> --
> Cheers  --  Tim
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Tim Streater-3
On 29 Jun 2017 at 08:01, Eric Grange <[hidden email]> wrote:

>> The sender, however, could be lying, and this needs to be considered
>
> This is an orthogonal problem: if the sender is sending you data that is
> not what it should be, then he could just as well be sending you
> well-encoded and well-formed but invalid data, or malware, or
> confidential/personal data you are not legally allowed to store, or, or,
> or... the list never ends.
>
> And generally speaking, if your code tries too hard to find a possible
> interpretation for invalid of malformed input, then you are far more likely
> to just end up with processed garbage, which will make it even harder to
> figure out down the road where the garbage in your database originated from
> (incorrect input? bug in the heuristics? etc.)

It will end up in the user's database. No heuristics are involved; I can do no more than believe what the sender tells me. The IDE I am using does at lest allow, in its base64-decode, that I request lossy conversion in the case of bad input.

--
Cheers  --  Tim
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Warren Young
In reply to this post by Keith Medcalf
On Jun 27, 2017, at 3:02 PM, Keith Medcalf <[hidden email]> wrote:
>
>> The whole point of
>> specifying a format as 7 bits is that the 8th bit is ignored, or
>> perhaps used in an implementation-defined manner, regardless of whether
>> the 8th bit in a char is available or not.
>
> ASCII was designed back in the days of low reliability serial communications -- you know, back when data was sent using 7 bit data + 1 parity bits + 2 stop bits -- to increase the reliability of the communications.  A "byte" was also 9 bits.  8 bits of data and a parity bit.

Before roughly the mid 1970s, the size of a byte was whatever the computer or communications system designer said it was.  Even within a single computer + serial comm system, the definitions could differ.  For this reason, we also have the term “octet,” which unambiguously means an 8-bit unit of data.

The 9-bit byte is largely a DEC-ism, since their pre-PDP-11 machines used a word size that was an integer multiple of 6 or 12.  DEC had 12-bit machines, 18-bit machines, and 36-bit machines.  There was even a plan for a 24-bit design at one point.

A common example would be a Teletype Model 33 ASR hardwired by DEC for transmitting 7-bit ASCII on 8-bit wide paper tapes with mark parity, fed by a 12-bit PDP-8 pulling that text off an RK05 cartridge disk from a file encoded in a 6-bit packed ASCII format.

6-bit packed ASCII schemes were common at the time: to efficiently store plain text in the native 12-, 18-, or 36-bit words, programmers would drop most of the control characters and punctuation, as well as either dropping or shift-encoding lowercase.

That isn’t an innovation from the DEC world, either: Émile Baudot came up with basically the same idea in his eponymous 5-bit telegraph code in 1870.  You could well say that Baudot code uses 5-bit bytes.  (This is also where the data communications unit “baud” comes from.)

The 8-bit byte standard — and its even multiples — is relatively recent in computing history.  You can point to early examples like the 32-bit IBM 360 and later ones like the 16-bit Data General Nova and DEC PDP-11, but I believe it was the flood of 8-bit microcomputers in the mid to late 1970s that finally and firmly associated “byte” with “8 bits”.

> Nowadays we use 8 bits for data with no parity

True parity bits (as opposed to mark/space parity) can only detect a 1-bit error.  We dropped parity checks when the data rates rose and SNR levels fell to the point that single-bit errors were a frequent occurrence, making parity checks practically useless.

> no error correction

The wonder, to my mind, is that it’s still an argument whether to use ECC RAM in any but the lowest-end machines.  You should have the option to put ECC RAM into any machine down to about the $500 level by simply paying a ~25% premium on the option cost for non-ECC RAM, but artificial market segmentation has kept ECC a feature of the server and high-end PC worlds only.

This sort of penny-pinching should have gone out of style in the 1990s, for the same reason Ethernet and USB use smarter error correction than did RS-232.

We should have flowed from parity RAM at the high end to ECC RAM at the high end to ECC everywhere by now.

> and no timing bits.

Timing bits aren’t needed when you have clock recovery hardware, which like ECC, is a superior technology that should be universal once transistors become sufficiently cheap.

Clock recovery becomes necessary once SNR levels get to the point they are now, where separate clock lines don’t really help any more.  You’d have to apply clock recovery type techniques to the clock line if you had it, so you might as well apply it to the data and leave the clock line out.

> Cuz when things screw up we want them to REALLY screw up ... and remain undetectable.

Thus the move toward strongly checksummed filesystems like ZFS, btrfs, HAMMER, APFS, and ReFS.

Like ECC, this is a battle that should be over by now, but we’re going to see HFS+, NTFS, and extfs hang on for a long time yet because $REASONS.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Simon Slavin-3
A couple of minor comments.

On 29 Jun 2017, at 5:39pm, Warren Young <[hidden email]> wrote:

> Before roughly the mid 1970s, the size of a byte was whatever the computer or communications system designer said it was.

You mean that size of a word.  The word "byte" means "by eight".  It did not always mean 7 bits of data and one parity bit, but it was always 8 bits in total.

> A common example would be a Teletype Model 33 ASR hardwired by DEC for transmitting 7-bit ASCII on 8-bit wide paper tapes with mark parity

Thank you for mentioning that.  First computer terminal I ever used.  I think I still have some of the paper tape somewhere.

> The 8-bit byte standard — and its even multiples — is relatively recent in computing history.  You can point to early examples like the 32-bit IBM 360 and later ones like the 16-bit Data General Nova and DEC PDP-11, but I believe it was the flood of 8-bit microcomputers in the mid to late 1970s that finally and firmly associated “byte” with “8 bits”.

Again, the word you want is "word".  There were architectures with all sorts of weird word sizes.  "byte" always meant "by eight" and was a synonym for "octet".

As Warren wrote, words did not always encode text as 8 bits per character.  Computers with 16-bit word sizes might encode ASCII as three 5-bit characters plus a parity bit, or use two 16-bit words for five 6-bit characters plus 2 meta-bits.  With each bit of storage costing around 100,000 times what they do now, and taking 10,000 times the time to move across your communications network, there was a wide variety of ingenious ways to save a bit here and a bit there.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Simon Slavin-3


On 29 Jun 2017, at 6:18pm, Simon Slavin <[hidden email]> wrote:

> Computers with 16-bit word sizes might encode ASCII as three 5-bit characters

Where I wrote "ASCII" I should have written "text".

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

John McKown
In reply to this post by Simon Slavin-3
On Thu, Jun 29, 2017 at 12:18 PM, Simon Slavin <[hidden email]> wrote:

> A couple of minor comments.
>
> On 29 Jun 2017, at 5:39pm, Warren Young <[hidden email]> wrote:
>
> > Before roughly the mid 1970s, the size of a byte was whatever the
> computer or communications system designer said it was.
>
> You mean that size of a word.  The word "byte" means "by eight".  It did
> not always mean 7 bits of data and one parity bit, but it was always 8 bits
> in total.
>
> > A common example would be a Teletype Model 33 ASR hardwired by DEC for
> transmitting 7-bit ASCII on 8-bit wide paper tapes with mark parity
>
> Thank you for mentioning that.  First computer terminal I ever used.  I
> think I still have some of the paper tape somewhere.
>
> > The 8-bit byte standard — and its even multiples — is relatively recent
> in computing history.  You can point to early examples like the 32-bit IBM
> 360 and later ones like the 16-bit Data General Nova and DEC PDP-11, but I
> believe it was the flood of 8-bit microcomputers in the mid to late 1970s
> that finally and firmly associated “byte” with “8 bits”.
>
> Again, the word you want is "word".  There were architectures with all
> sorts of weird word sizes.  "byte" always meant "by eight" and was a
> synonym for "octet".
>
> As Warren wrote, words did not always encode text as 8 bits per
> character.  Computers with 16-bit word sizes might encode ASCII as three
> 5-bit characters plus a parity bit, or use two 16-bit words for five 6-bit
> characters plus 2 meta-bits.  With each bit of storage costing around
> 100,000 times what they do now, and taking 10,000 times the time to move
> across your communications network, there was a wide variety of ingenious
> ways to save a bit here and a bit there.
>
> Simon.
>
>
​In today's world, you are completely correct. However, according to
Wikipedia (https://en.wikipedia.org/wiki/Byte_addressing), there was at
least one machine (Honeywell) which had a 36 bit word which was divided
into 9 bit "bytes" (i.e. an address pointed to a 9 bit "byte").​


--
Veni, Vidi, VISA: I came, I saw, I did a little shopping.

Maranatha! <><
John McKown
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
12
Loading...