papercut wish list : a PRAGMA encoding='UTF-8-SIG'

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

papercut wish list : a PRAGMA encoding='UTF-8-SIG'

big stone
Hello,

As a windows user, I would like that sqlite.exe would support the encoding
'UTF-8-SIG' for files.

'UTF-8-SIG' = normal 'UTF-8' file, but starting by a Byte-Order-Mark.

(see http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8)

For windows user, it would be a truly appreciated improvement.

Indeed :

- if I try to read an utf-8-sig file generated by excel or other thing, my
header is blowed up because of this lack of feature,

- if I want to avoid this, I have to use utf-16, and my intermediate file
baloon in size over the network.

Regards,
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: papercut wish list : a PRAGMA encoding='UTF-8-SIG'

big stone
... unless I'm fooled by the sqlite.exe dos output and it's working already
as I hope.

(Oups)
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: papercut wish list : a PRAGMA encoding='UTF-8-SIG'

Scott Robison-2
On Wed, Sep 3, 2014 at 2:03 PM, big stone <[hidden email]> wrote:

> ... unless I'm fooled by the sqlite.exe dos output and it's working already
> as I hope.
>
> (Oups)
>

The pragma encoding only controls the storage format of text strings, I
don't believe it has any impact on imported files.

--
Scott Robison
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: papercut wish list : a PRAGMA encoding='UTF-8-SIG'

Stephan Beal-3
In reply to this post by big stone
On Wed, Sep 3, 2014 at 9:56 PM, big stone <[hidden email]> wrote:

> As a windows user, I would like that sqlite.exe would support the encoding
> 'UTF-8-SIG' for files.
>
> 'UTF-8-SIG' = normal 'UTF-8' file, but starting by a Byte-Order-Mark.
>


>
> (see http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8)
>

If you'll read the second paragraph:

The Unicode Standard permits the BOM in UTF-8
<http://en.wikipedia.org/wiki/UTF-8>,[2]
<http://en.wikipedia.org/wiki/Byte_order_mark#cite_note-2> but does not
require or recommend its use.[3]
<http://en.wikipedia.org/wiki/Byte_order_mark#cite_note-3>

(though that should be "nor" instead of "or")



> - if I try to read an utf-8-sig file generated by excel or other thing, my
> header is blowed up because of this lack of feature,
>

That's a bug. A BOM is senseless in UTF8 because UTF8 has no byte-ordering
issues. Many software chokes on a BOM (e.g. i've seen PHP-based sites go
down because a dev's editor inserted one and it got deployed).

--
----- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
"Freedom is sloppy. But since tyranny's the only guaranteed byproduct of
those who insist on a perfect world, freedom will have to do." -- Bigby Wolf
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: papercut wish list : a PRAGMA encoding='UTF-8-SIG'

jose isaias cabrera

"Stephan Beal" wrote...

> On Wed, Sep 3, 2014 at 9:56 PM, big stone <[hidden email]> wrote:
>
> issues. Many software chokes on a BOM (e.g. i've seen PHP-based sites go
> down because a dev's editor inserted one and it got deployed).

PHP should handle the encoding whether or not it has the BOM.

josé

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: papercut wish list : a PRAGMA encoding='UTF-8-SIG'

Stephan Beal-3
On Wed, Sep 3, 2014 at 11:13 PM, jose isaias cabrera <
[hidden email]> wrote:

>
> "Stephan Beal" wrote...
>
>  On Wed, Sep 3, 2014 at 9:56 PM, big stone <[hidden email]> wrote:
>>
>> issues. Many software chokes on a BOM (e.g. i've seen PHP-based sites go
>> down because a dev's editor inserted one and it got deployed).
>>
>
> PHP should handle the encoding whether or not it has the BOM.
>

As "should" Excel!

Unlike Excel, with PHP the fix is easy - remove the BOM, which is simple
once you have a program which lets you know it's there (it's hidden in many
editors).

My point is only - adding a BOM is not a viable solution: it's a
deprecated/discouraged/worst-practice because so many tools don't deal well
with them.

--
----- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
"Freedom is sloppy. But since tyranny's the only guaranteed byproduct of
those who insist on a perfect world, freedom will have to do." -- Bigby Wolf
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: papercut wish list : a PRAGMA encoding='UTF-8-SIG'

Scott Robison-2
On Wed, Sep 3, 2014 at 3:20 PM, Stephan Beal <[hidden email]> wrote:

> On Wed, Sep 3, 2014 at 11:13 PM, jose isaias cabrera <
> [hidden email]> wrote:
> > PHP should handle the encoding whether or not it has the BOM.
> >
>
> As "should" Excel!
>
> Unlike Excel, with PHP the fix is easy - remove the BOM, which is simple
> once you have a program which lets you know it's there (it's hidden in many
> editors).
>
> My point is only - adding a BOM is not a viable solution: it's a
> deprecated/discouraged/worst-practice because so many tools don't deal well
> with them.
>

The problem is that the recommendations have varied over time (do use it,
don't use it). Win32 was supporting good old 2 byte only Unicode back
before there was a UTF-8 (or indeed before there was a UTF-16). The name
"byte order mark" is a misnomer (though mostly accurate). The BOM is really
a "zero width non breaking space" which is "harmless" at the beginning of a
file and thus useful as a signature to identify the encoding of a file when
other metadata / encoding information is not available. It is useful (if
used) to detect the encoding of a file, but posix systems "wimped out" and
basically never embraced the Unicode specification as originally written,
writing their own encoding, a File System Safe Unicode Transformation
Format (FSS UTF) which later became UTF-8.

Don't get me wrong, I like UTF-8, preferring it to UTF-16. But the
rationale for using ZWNBS as a signature is useful for all UTF encodings
when parties agree to use it as such. When the only Unicode standard was
UCS-2, it was only useful as a signature to determine the byte order of the
originating system. After UTF-8 was introduced, it became useful in its own
right to identify a byte oriented encoding. Then when UCS-2 was effectively
dropped and UTF-16 replaced it (so that the code point space could be
extended from 1 16 bit plane to 17 16 bit planes), and UTF-32 was
introduced, it became even more useful.

In any case: If we can't get all operating systems to agree what code
sequence marks the end of a line of text (CR only, LF only, CR LF,
something else entirely), I don't expect we'll get agreement on the
"proper" way to construct Unicode centric text files any time soon. Both
approaches have pros and cons, though I would maintain that external
metadata that unambiguously identifies the text encoding is by far the best
option, far preferable to guessing, no matter how high the confidence
factor is that the guess is correct.

--
Scott Robison
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users