SQLite and html character entities

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

SQLite and html character entities

Puneet Kishor-2
Folks,

I come to ask you a question that may be basic for many of you but is
leaving me completely bewildered. My work environment is a Mac OS X
(Tiger) computer, and I use a Cocoa-based text editor, and am writing
a Perl-based web app. Data are in several different languages,
predominantly English, but with Portuguese, Spanish, and other
languages mixed in... hence, have accent marks (diacritics).

Goal: To reliably and consistently show the retrieved data in a web
page or a web form with the correct diacritics, and when the user
edits and updates that data, reliably and consistently update the
database.

Summary of problem: Data with diacritics show up fine in web forms,
but on updating, they get clobbered with gibberish and subsequently
show up incorrectly.

So, I decided to do a little test. I created a small table, wrote a
script, and inserted a few records from the web. See the output of my
investigation below. I ask you, what is it that I have to do to
achieve my goal above? (output of test follows; I have separated
logical sections with a "-------" line, and my comments start with #)

Lucknow:~/Data/ecoservices punkish$ sqlite3 entities.sqlite
SQLite version 3.3.8
Enter ".help" for instructions
sqlite> .s
CREATE TABLE tbl (a text);
sqlite> select * from tbl;
the first record
é ç ñ î
more from 3rd row
row four
these "volunteered" activities
<á ø ã ü î & others>
-----------------------------
sqlite> .mode csv
sqlite> .output foo.csv
sqlite> select * from tbl;
sqlite> .q
Lucknow:~/Data/ecoservices punkish$ less foo.csv
"the first record"
"\351 \347 \361 \356"
"more from 3rd row"
"row four"
"these \223volunteered\224 activities"
"<\341 \370 \343 \374 \356 & others>"
foo.csv (END)
-----------------------------
sqlite> .mode html
sqlite> .output foo.html
sqlite> select * from tbl;
sqlite> .q
Lucknow:~/Data/ecoservices punkish$ less foo.html
"foo.html" may be a binary file.  See it anyway?
<TR><TD>the first record</TD>
</TR>
<TR><TD><E9> <E7> <F1> <EE></TD>
</TR>
<TR><TD>more from 3rd row</TD>
</TR>
<TR><TD>row four</TD>
</TR>
<TR><TD>these <93>volunteered<94> activities</TD>
</TR>
<TR><TD>&lt;<E1> <F8> <E3> <FC> <EE> &amp; others></TD>
</TR>
foo.html (END)
-----------------------------
# below foo.html in my Cocoa-based text editor
<TR><TD>the first record</TD>
</TR>
<TR><TD>È Á Ò Ó</TD>
</TR>
<TR><TD>more from 3rd row</TD>
</TR>
<TR><TD>row four</TD>
</TR>
<TR><TD>these ìvolunteeredî activities</TD>
</TR>
<TR><TD>&lt;· ¯ „ ¸ Ó &amp; others></TD>
</TR>
-----------------------------
# below foo.html in Safari; I added <TABLE> tags to format correctly
the first record
é ç ñ î
more from 3rd row
row four
these "volunteered" activities
<á ø ã ü î & others>

-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: SQLite and html character entities

Nuno Lucas-2
You have to know the encoding of the user input. To do that, all your
html forms _MUST_ have proper <META> tags, and as you will be using
SQLite, the obvious encoding choice will be UTF-8 (because that way
you don't need to do any conversions when feeding/retrieving data
to/from SQLite).

Then there is the problem of non-compliant browsers, but that is
another history...


Best regards,
~Nuno Lucas


On 9/20/07, P Kishor <[hidden email]> wrote:

> Folks,
>
> I come to ask you a question that may be basic for many of you but is
> leaving me completely bewildered. My work environment is a Mac OS X
> (Tiger) computer, and I use a Cocoa-based text editor, and am writing
> a Perl-based web app. Data are in several different languages,
> predominantly English, but with Portuguese, Spanish, and other
> languages mixed in... hence, have accent marks (diacritics).
>
> Goal: To reliably and consistently show the retrieved data in a web
> page or a web form with the correct diacritics, and when the user
> edits and updates that data, reliably and consistently update the
> database.
>
> Summary of problem: Data with diacritics show up fine in web forms,
> but on updating, they get clobbered with gibberish and subsequently
> show up incorrectly.
>
> So, I decided to do a little test. I created a small table, wrote a
> script, and inserted a few records from the web. See the output of my
> investigation below. I ask you, what is it that I have to do to
> achieve my goal above? (output of test follows; I have separated
> logical sections with a "-------" line, and my comments start with #)
>
> Lucknow:~/Data/ecoservices punkish$ sqlite3 entities.sqlite
> SQLite version 3.3.8
> Enter ".help" for instructions
> sqlite> .s
> CREATE TABLE tbl (a text);
> sqlite> select * from tbl;
> the first record
> é ç ñ î
> more from 3rd row
> row four
> these "volunteered" activities
> <á ø ã ü î & others>
> -----------------------------
> sqlite> .mode csv
> sqlite> .output foo.csv
> sqlite> select * from tbl;
> sqlite> .q
> Lucknow:~/Data/ecoservices punkish$ less foo.csv
> "the first record"
> "\351 \347 \361 \356"
> "more from 3rd row"
> "row four"
> "these \223volunteered\224 activities"
> "<\341 \370 \343 \374 \356 & others>"
> foo.csv (END)
> -----------------------------
> sqlite> .mode html
> sqlite> .output foo.html
> sqlite> select * from tbl;
> sqlite> .q
> Lucknow:~/Data/ecoservices punkish$ less foo.html
> "foo.html" may be a binary file.  See it anyway?
> <TR><TD>the first record</TD>
> </TR>
> <TR><TD><E9> <E7> <F1> <EE></TD>
> </TR>
> <TR><TD>more from 3rd row</TD>
> </TR>
> <TR><TD>row four</TD>
> </TR>
> <TR><TD>these <93>volunteered<94> activities</TD>
> </TR>
> <TR><TD>&lt;<E1> <F8> <E3> <FC> <EE> &amp; others></TD>
> </TR>
> foo.html (END)
> -----------------------------
> # below foo.html in my Cocoa-based text editor
> <TR><TD>the first record</TD>
> </TR>
> <TR><TD>È Á Ò Ó</TD>
> </TR>
> <TR><TD>more from 3rd row</TD>
> </TR>
> <TR><TD>row four</TD>
> </TR>
> <TR><TD>these ìvolunteeredî activities</TD>
> </TR>
> <TR><TD>&lt;· ¯ „ ¸ Ó &amp; others></TD>
> </TR>
> -----------------------------
> # below foo.html in Safari; I added <TABLE> tags to format correctly
> the first record
> é ç ñ î
> more from 3rd row
> row four
> these "volunteered" activities
> <á ø ã ü î & others>
>
> -----------------------------------------------------------------------------
> To unsubscribe, send email to [hidden email]
> -----------------------------------------------------------------------------
>
>

-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: SQLite and html character entities

Puneet Kishor-2
Thanks Nuno. Since I am raw in this matter, could I ask you for a
little more hand-holding as specified below --

On 9/20/07, Nuno Lucas <[hidden email]> wrote:
> You have to know the encoding of the user input. To do that, all your
> html forms _MUST_ have proper <META> tags, and as you will be using
> SQLite, the obvious encoding choice will be UTF-8 (because that way
> you don't need to do any conversions when feeding/retrieving data
> to/from SQLite).

So, what is the proper meta tag? Is the following sufficient?

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        "http://www.w3.org/TR/html4/loose.dtd">
<html lang="">
  <head>
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    <title></title>
  </head>
  <body>

And, other than the above, I don't have to do anything else? Just a
straight ahead INSERT with bind vars is enough?

Many thanks in advance,

Puneet.


>
> Then there is the problem of non-compliant browsers, but that is
> another history...
>
>
> Best regards,
> ~Nuno Lucas
>
>
> On 9/20/07, P Kishor <[hidden email]> wrote:
> > Folks,
> >
> > I come to ask you a question that may be basic for many of you but is
> > leaving me completely bewildered. My work environment is a Mac OS X
> > (Tiger) computer, and I use a Cocoa-based text editor, and am writing
> > a Perl-based web app. Data are in several different languages,
> > predominantly English, but with Portuguese, Spanish, and other
> > languages mixed in... hence, have accent marks (diacritics).
> >
> > Goal: To reliably and consistently show the retrieved data in a web
> > page or a web form with the correct diacritics, and when the user
> > edits and updates that data, reliably and consistently update the
> > database.
> >
> > Summary of problem: Data with diacritics show up fine in web forms,
> > but on updating, they get clobbered with gibberish and subsequently
> > show up incorrectly.
> >
> > So, I decided to do a little test. I created a small table, wrote a
> > script, and inserted a few records from the web. See the output of my
> > investigation below. I ask you, what is it that I have to do to
> > achieve my goal above? (output of test follows; I have separated
> > logical sections with a "-------" line, and my comments start with #)
> >
> > Lucknow:~/Data/ecoservices punkish$ sqlite3 entities.sqlite
> > SQLite version 3.3.8
> > Enter ".help" for instructions
> > sqlite> .s
> > CREATE TABLE tbl (a text);
> > sqlite> select * from tbl;
> > the first record
> > é ç ñ î
> > more from 3rd row
> > row four
> > these "volunteered" activities
> > <á ø ã ü î & others>
> > -----------------------------
> > sqlite> .mode csv
> > sqlite> .output foo.csv
> > sqlite> select * from tbl;
> > sqlite> .q
> > Lucknow:~/Data/ecoservices punkish$ less foo.csv
> > "the first record"
> > "\351 \347 \361 \356"
> > "more from 3rd row"
> > "row four"
> > "these \223volunteered\224 activities"
> > "<\341 \370 \343 \374 \356 & others>"
> > foo.csv (END)
> > -----------------------------
> > sqlite> .mode html
> > sqlite> .output foo.html
> > sqlite> select * from tbl;
> > sqlite> .q
> > Lucknow:~/Data/ecoservices punkish$ less foo.html
> > "foo.html" may be a binary file.  See it anyway?
> > <TR><TD>the first record</TD>
> > </TR>
> > <TR><TD><E9> <E7> <F1> <EE></TD>
> > </TR>
> > <TR><TD>more from 3rd row</TD>
> > </TR>
> > <TR><TD>row four</TD>
> > </TR>
> > <TR><TD>these <93>volunteered<94> activities</TD>
> > </TR>
> > <TR><TD>&lt;<E1> <F8> <E3> <FC> <EE> &amp; others></TD>
> > </TR>
> > foo.html (END)
> > -----------------------------
> > # below foo.html in my Cocoa-based text editor
> > <TR><TD>the first record</TD>
> > </TR>
> > <TR><TD>È Á Ò Ó</TD>
> > </TR>
> > <TR><TD>more from 3rd row</TD>
> > </TR>
> > <TR><TD>row four</TD>
> > </TR>
> > <TR><TD>these ìvolunteeredî activities</TD>
> > </TR>
> > <TR><TD>&lt;· ¯ „ ¸ Ó &amp; others></TD>
> > </TR>
> > -----------------------------
> > # below foo.html in Safari; I added <TABLE> tags to format correctly
> > the first record
> > é ç ñ î
> > more from 3rd row
> > row four
> > these "volunteered" activities
> > <á ø ã ü î & others>
> >
> > -----------------------------------------------------------------------------
> > To unsubscribe, send email to [hidden email]
> > -----------------------------------------------------------------------------
> >
> >
>


--
Puneet Kishor
http://punkish.eidesis.org/
Nelson Institute for Environmental Studies
http://www.nelson.wisc.edu/
Open Source Geospatial Foundation (OSGeo)
http://www.osgeo.org/
Summer 2007 S&T Policy Fellow, The National Academies
http://www.nas.edu/

-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: SQLite and html character entities

Clark Christensen
In reply to this post by Puneet Kishor-2
The META tag you include looks correct to me.

Does perl get the chars right after CGI decodes them?

The browser, ultimately, will escape the accented characters into urlencoded chars based on the utf-8 charset you specify in the HTML META  tag.  Then Perl (via CGI) is going to decode those back into characters, probably using the host's default charset.  Seems like there's a chance for perl to mangle the accented chars during this step.

 -Clark

----- Original Message ----
From: P Kishor <[hidden email]>
To: Nuno Lucas <[hidden email]>
Cc: [hidden email]
Sent: Thursday, September 20, 2007 6:43:37 AM
Subject: Re: [sqlite] SQLite and html character entities

Thanks Nuno. Since I am raw in this matter, could I ask you for a
little more hand-holding as specified below --

On 9/20/07, Nuno Lucas <[hidden email]> wrote:
> You have to know the encoding of the user input. To do that, all your
> html forms _MUST_ have proper <META> tags, and as you will be using
> SQLite, the obvious encoding choice will be UTF-8 (because that way
> you don't need to do any conversions when feeding/retrieving data
> to/from SQLite).

So, what is the proper meta tag? Is the following sufficient?

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
    "http://www.w3.org/TR/html4/loose.dtd">
<html lang="">
  <head>
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
    <title></title>
  </head>
  <body>

And, other than the above, I don't have to do anything else? Just a
straight ahead INSERT with bind vars is enough?

Many thanks in advance,

Puneet.


>
> Then there is the problem of non-compliant browsers, but that is
> another history...
>
>
> Best regards,
> ~Nuno Lucas
>
>
> On 9/20/07, P Kishor <[hidden email]> wrote:
> > Folks,
> >
> > I come to ask you a question that may be basic for many of you but
 is
> > leaving me completely bewildered. My work environment is a Mac OS X
> > (Tiger) computer, and I use a Cocoa-based text editor, and am
 writing

> > a Perl-based web app. Data are in several different languages,
> > predominantly English, but with Portuguese, Spanish, and other
> > languages mixed in... hence, have accent marks (diacritics).
> >
> > Goal: To reliably and consistently show the retrieved data in a web
> > page or a web form with the correct diacritics, and when the user
> > edits and updates that data, reliably and consistently update the
> > database.
> >
> > Summary of problem: Data with diacritics show up fine in web forms,
> > but on updating, they get clobbered with gibberish and subsequently
> > show up incorrectly.
> >
> > So, I decided to do a little test. I created a small table, wrote a
> > script, and inserted a few records from the web. See the output of
 my
> > investigation below. I ask you, what is it that I have to do to
> > achieve my goal above? (output of test follows; I have separated
> > logical sections with a "-------" line, and my comments start with
 #)

> >
> > Lucknow:~/Data/ecoservices punkish$ sqlite3 entities.sqlite
> > SQLite version 3.3.8
> > Enter ".help" for instructions
> > sqlite> .s
> > CREATE TABLE tbl (a text);
> > sqlite> select * from tbl;
> > the first record
> > é ç ñ î
> > more from 3rd row
> > row four
> > these "volunteered" activities
> > <á ø ã ü î & others>
> > -----------------------------
> > sqlite> .mode csv
> > sqlite> .output foo.csv
> > sqlite> select * from tbl;
> > sqlite> .q
> > Lucknow:~/Data/ecoservices punkish$ less foo.csv
> > "the first record"
> > "\351 \347 \361 \356"
> > "more from 3rd row"
> > "row four"
> > "these \223volunteered\224 activities"
> > "<\341 \370 \343 \374 \356 & others>"
> > foo.csv (END)
> > -----------------------------
> > sqlite> .mode html
> > sqlite> .output foo.html
> > sqlite> select * from tbl;
> > sqlite> .q
> > Lucknow:~/Data/ecoservices punkish$ less foo.html
> > "foo.html" may be a binary file.  See it anyway?
> > <TR><TD>the first record</TD>
> > </TR>
> > <TR><TD><E9> <E7> <F1> <EE></TD>
> > </TR>
> > <TR><TD>more from 3rd row</TD>
> > </TR>
> > <TR><TD>row four</TD>
> > </TR>
> > <TR><TD>these <93>volunteered<94> activities</TD>
> > </TR>
> > <TR><TD>&lt;<E1> <F8> <E3> <FC> <EE> &amp; others></TD>
> > </TR>
> > foo.html (END)
> > -----------------------------
> > # below foo.html in my Cocoa-based text editor
> > <TR><TD>the first record</TD>
> > </TR>
> > <TR><TD>È Á Ò Ó</TD>
> > </TR>
> > <TR><TD>more from 3rd row</TD>
> > </TR>
> > <TR><TD>row four</TD>
> > </TR>
> > <TR><TD>these ìvolunteeredî activities</TD>
> > </TR>
> > <TR><TD>&lt;· ¯ „ ¸ Ó &amp; others></TD>
> > </TR>
> > -----------------------------
> > # below foo.html in Safari; I added <TABLE> tags to format
 correctly
> > the first record
> > é ç ñ î
> > more from 3rd row
> > row four
> > these "volunteered" activities
> > <á ø ã ü î & others>
> >
> >
 -----------------------------------------------------------------------------
> > To unsubscribe, send email to [hidden email]
> >
 -----------------------------------------------------------------------------
> >
> >
>


--
Puneet Kishor
http://punkish.eidesis.org/
Nelson Institute for Environmental Studies
http://www.nelson.wisc.edu/
Open Source Geospatial Foundation (OSGeo)
http://www.osgeo.org/
Summer 2007 S&T Policy Fellow, The National Academies
http://www.nas.edu/

-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------





-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: SQLite and html character entities

Puneet Kishor-2
On 9/20/07, Clark Christensen <[hidden email]> wrote:
> The META tag you include looks correct to me.
>
> Does perl get the chars right after CGI decodes them?

This is what I am not completely sure of.

>
> The browser, ultimately, will escape the accented characters into urlencoded chars based on the utf-8 charset you specify in the HTML META  tag.  Then Perl (via CGI) is going to decode those back into characters, probably using the host's default charset.  Seems like there's a chance for perl to mangle the accented chars during this step.
>

1. The following is one scenario I have experienced:

(a) SQLite has accented data
(b) I retrieve that data and display it in web form, and it show
accented and correctly
(c) I update some text in that form (not necessarily the accented
text) and update the entire record
(d) viewing the same data shows that it has gotten clobbered.

2. I have also experienced the following scenario:

(a) My Cocoa-based text editor writes scripts in utf8. So, stuff that
shows up at accented in the text editor, shows up funky in vim
(b) the same text shows up fine in the browser both ways, to and from
the browser, without any special encoding or decoding
(c) However, if I stick some accented text in a scalar and then
display it in the browser, it shows up clobbered in the browser even
though it shows up fine in the editor (see (a) and (b) above... fine
in the editor, screwed up in vim).
(d) I know (a) is true because if I add "use utf8" to the script then
all is well.

I don't care so much about the 2nd scenario except that my application
itself is in at least 3 languages (for now). That is, the interface
text, not just the data from SQLite, is also in Spanish and
Portuguese. Well, I have created separated templates and used properly
escaped entities in the interface text (&eacute; for é and so on) so
the interface text is fine and reliable. I just want to make sure that
nothing is clobbered in the data during its roundtrip from and to the
database, and assuming that the users will update the text with
correct accents instead of with escaped entities (that is, the users
will type é instead of &eacute;).

Puneet.


>
> ----- Original Message ----
> From: P Kishor <[hidden email]>
> To: Nuno Lucas <[hidden email]>
> Cc: [hidden email]
> Sent: Thursday, September 20, 2007 6:43:37 AM
> Subject: Re: [sqlite] SQLite and html character entities
>
> Thanks Nuno. Since I am raw in this matter, could I ask you for a
> little more hand-holding as specified below --
>
> On 9/20/07, Nuno Lucas <[hidden email]> wrote:
> > You have to know the encoding of the user input. To do that, all your
> > html forms _MUST_ have proper <META> tags, and as you will be using
> > SQLite, the obvious encoding choice will be UTF-8 (because that way
> > you don't need to do any conversions when feeding/retrieving data
> > to/from SQLite).
>
> So, what is the proper meta tag? Is the following sufficient?
>
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
>     "http://www.w3.org/TR/html4/loose.dtd">
> <html lang="">
>   <head>
>     <meta http-equiv="content-type" content="text/html; charset=utf-8">
>     <title></title>
>   </head>
>   <body>
>
> And, other than the above, I don't have to do anything else? Just a
> straight ahead INSERT with bind vars is enough?
>
> Many thanks in advance,
>
> Puneet.
>
>
> >
> > Then there is the problem of non-compliant browsers, but that is
> > another history...
> >
> >
> > Best regards,
> > ~Nuno Lucas
> >
> >
> > On 9/20/07, P Kishor <[hidden email]> wrote:
> > > Folks,
> > >
> > > I come to ask you a question that may be basic for many of you but
>  is
> > > leaving me completely bewildered. My work environment is a Mac OS X
> > > (Tiger) computer, and I use a Cocoa-based text editor, and am
>  writing
> > > a Perl-based web app. Data are in several different languages,
> > > predominantly English, but with Portuguese, Spanish, and other
> > > languages mixed in... hence, have accent marks (diacritics).
> > >
> > > Goal: To reliably and consistently show the retrieved data in a web
> > > page or a web form with the correct diacritics, and when the user
> > > edits and updates that data, reliably and consistently update the
> > > database.
> > >
> > > Summary of problem: Data with diacritics show up fine in web forms,
> > > but on updating, they get clobbered with gibberish and subsequently
> > > show up incorrectly.
> > >
> > > So, I decided to do a little test. I created a small table, wrote a
> > > script, and inserted a few records from the web. See the output of
>  my
> > > investigation below. I ask you, what is it that I have to do to
> > > achieve my goal above? (output of test follows; I have separated
> > > logical sections with a "-------" line, and my comments start with
>  #)
> > >
> > > Lucknow:~/Data/ecoservices punkish$ sqlite3 entities.sqlite
> > > SQLite version 3.3.8
> > > Enter ".help" for instructions
> > > sqlite> .s
> > > CREATE TABLE tbl (a text);
> > > sqlite> select * from tbl;
> > > the first record
> > > é ç ñ î
> > > more from 3rd row
> > > row four
> > > these "volunteered" activities
> > > <á ø ã ü î & others>
> > > -----------------------------
> > > sqlite> .mode csv
> > > sqlite> .output foo.csv
> > > sqlite> select * from tbl;
> > > sqlite> .q
> > > Lucknow:~/Data/ecoservices punkish$ less foo.csv
> > > "the first record"
> > > "\351 \347 \361 \356"
> > > "more from 3rd row"
> > > "row four"
> > > "these \223volunteered\224 activities"
> > > "<\341 \370 \343 \374 \356 & others>"
> > > foo.csv (END)
> > > -----------------------------
> > > sqlite> .mode html
> > > sqlite> .output foo.html
> > > sqlite> select * from tbl;
> > > sqlite> .q
> > > Lucknow:~/Data/ecoservices punkish$ less foo.html
> > > "foo.html" may be a binary file.  See it anyway?
> > > <TR><TD>the first record</TD>
> > > </TR>
> > > <TR><TD><E9> <E7> <F1> <EE></TD>
> > > </TR>
> > > <TR><TD>more from 3rd row</TD>
> > > </TR>
> > > <TR><TD>row four</TD>
> > > </TR>
> > > <TR><TD>these <93>volunteered<94> activities</TD>
> > > </TR>
> > > <TR><TD><<E1> <F8> <E3> <FC> <EE> & others></TD>
> > > </TR>
> > > foo.html (END)
> > > -----------------------------
> > > # below foo.html in my Cocoa-based text editor
> > > <TR><TD>the first record</TD>
> > > </TR>
> > > <TR><TD>È Á Ò Ó</TD>
> > > </TR>
> > > <TR><TD>more from 3rd row</TD>
> > > </TR>
> > > <TR><TD>row four</TD>
> > > </TR>
> > > <TR><TD>these ìvolunteeredî activities</TD>
> > > </TR>
> > > <TR><TD><· ¯ „ ¸ Ó & others></TD>
> > > </TR>
> > > -----------------------------
> > > # below foo.html in Safari; I added <TABLE> tags to format
>  correctly
> > > the first record
> > > é ç ñ î
> > > more from 3rd row
> > > row four
> > > these "volunteered" activities
> > > <á ø ã ü î & others>
> > >
> > >
>  -----------------------------------------------------------------------------
> > > To unsubscribe, send email to [hidden email]
> > >
>  -----------------------------------------------------------------------------
> > >
> > >
> >
>
>
> --
> Puneet Kishor
> http://punkish.eidesis.org/
> Nelson Institute for Environmental Studies
> http://www.nelson.wisc.edu/
> Open Source Geospatial Foundation (OSGeo)
> http://www.osgeo.org/
> Summer 2007 S&T Policy Fellow, The National Academies
> http://www.nas.edu/
>
> -----------------------------------------------------------------------------
> To unsubscribe, send email to [hidden email]
> -----------------------------------------------------------------------------
>
>
>
>
>
> -----------------------------------------------------------------------------
> To unsubscribe, send email to [hidden email]
> -----------------------------------------------------------------------------
>
>


--
Puneet Kishor
http://punkish.eidesis.org/
Nelson Institute for Environmental Studies
http://www.nelson.wisc.edu/
Open Source Geospatial Foundation (OSGeo)
http://www.osgeo.org/
Summer 2007 S&T Policy Fellow, The National Academies
http://www.nas.edu/

-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: SQLite and html character entities

Clark Christensen
In reply to this post by Puneet Kishor-2
It's a nightmare.  I can think of some ugly client-side scripting things to investigate, but I think you'd be on the way down a bunny hole :-))

If you submit the data, and do nothing but print it back to the browser, does it display correctly?

If you submit, store, retrieve and send it back to the browser, does it display correctly?

 -Clark


----- Original Message ----
From: P Kishor <[hidden email]>
To: [hidden email]
Sent: Thursday, September 20, 2007 9:12:30 AM
Subject: Re: [sqlite] SQLite and html character entities

On 9/20/07, Clark Christensen <[hidden email]> wrote:
> The META tag you include looks correct to me.
>
> Does perl get the chars right after CGI decodes them?

This is what I am not completely sure of.

>
> The browser, ultimately, will escape the accented characters into
 urlencoded chars based on the utf-8 charset you specify in the HTML META
  tag.  Then Perl (via CGI) is going to decode those back into
 characters, probably using the host's default charset.  Seems like there's a
 chance for perl to mangle the accented chars during this step.
>

1. The following is one scenario I have experienced:

(a) SQLite has accented data
(b) I retrieve that data and display it in web form, and it show
accented and correctly
(c) I update some text in that form (not necessarily the accented
text) and update the entire record
(d) viewing the same data shows that it has gotten clobbered.

2. I have also experienced the following scenario:

(a) My Cocoa-based text editor writes scripts in utf8. So, stuff that
shows up at accented in the text editor, shows up funky in vim
(b) the same text shows up fine in the browser both ways, to and from
the browser, without any special encoding or decoding
(c) However, if I stick some accented text in a scalar and then
display it in the browser, it shows up clobbered in the browser even
though it shows up fine in the editor (see (a) and (b) above... fine
in the editor, screwed up in vim).
(d) I know (a) is true because if I add "use utf8" to the script then
all is well.

I don't care so much about the 2nd scenario except that my application
itself is in at least 3 languages (for now). That is, the interface
text, not just the data from SQLite, is also in Spanish and
Portuguese. Well, I have created separated templates and used properly
escaped entities in the interface text (&eacute; for é and so on) so
the interface text is fine and reliable. I just want to make sure that
nothing is clobbered in the data during its roundtrip from and to the
database, and assuming that the users will update the text with
correct accents instead of with escaped entities (that is, the users
will type é instead of &eacute;).

Puneet.


>
> ----- Original Message ----
> From: P Kishor <[hidden email]>
> To: Nuno Lucas <[hidden email]>
> Cc: [hidden email]
> Sent: Thursday, September 20, 2007 6:43:37 AM
> Subject: Re: [sqlite] SQLite and html character entities
>
> Thanks Nuno. Since I am raw in this matter, could I ask you for a
> little more hand-holding as specified below --
>
> On 9/20/07, Nuno Lucas <[hidden email]> wrote:
> > You have to know the encoding of the user input. To do that, all
 your

> > html forms _MUST_ have proper <META> tags, and as you will be using
> > SQLite, the obvious encoding choice will be UTF-8 (because that way
> > you don't need to do any conversions when feeding/retrieving data
> > to/from SQLite).
>
> So, what is the proper meta tag? Is the following sufficient?
>
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
>     "http://www.w3.org/TR/html4/loose.dtd">
> <html lang="">
>   <head>
>     <meta http-equiv="content-type" content="text/html;
 charset=utf-8">

>     <title></title>
>   </head>
>   <body>
>
> And, other than the above, I don't have to do anything else? Just a
> straight ahead INSERT with bind vars is enough?
>
> Many thanks in advance,
>
> Puneet.
>
>
> >
> > Then there is the problem of non-compliant browsers, but that is
> > another history...
> >
> >
> > Best regards,
> > ~Nuno Lucas
> >
> >
> > On 9/20/07, P Kishor <[hidden email]> wrote:
> > > Folks,
> > >
> > > I come to ask you a question that may be basic for many of you
 but
>  is
> > > leaving me completely bewildered. My work environment is a Mac OS
 X
> > > (Tiger) computer, and I use a Cocoa-based text editor, and am
>  writing
> > > a Perl-based web app. Data are in several different languages,
> > > predominantly English, but with Portuguese, Spanish, and other
> > > languages mixed in... hence, have accent marks (diacritics).
> > >
> > > Goal: To reliably and consistently show the retrieved data in a
 web
> > > page or a web form with the correct diacritics, and when the user
> > > edits and updates that data, reliably and consistently update the
> > > database.
> > >
> > > Summary of problem: Data with diacritics show up fine in web
 forms,
> > > but on updating, they get clobbered with gibberish and
 subsequently
> > > show up incorrectly.
> > >
> > > So, I decided to do a little test. I created a small table, wrote
 a
> > > script, and inserted a few records from the web. See the output
 of
>  my
> > > investigation below. I ask you, what is it that I have to do to
> > > achieve my goal above? (output of test follows; I have separated
> > > logical sections with a "-------" line, and my comments start
 with

>  #)
> > >
> > > Lucknow:~/Data/ecoservices punkish$ sqlite3 entities.sqlite
> > > SQLite version 3.3.8
> > > Enter ".help" for instructions
> > > sqlite> .s
> > > CREATE TABLE tbl (a text);
> > > sqlite> select * from tbl;
> > > the first record
> > > é ç ñ î
> > > more from 3rd row
> > > row four
> > > these "volunteered" activities
> > > <á ø ã ü î & others>
> > > -----------------------------
> > > sqlite> .mode csv
> > > sqlite> .output foo.csv
> > > sqlite> select * from tbl;
> > > sqlite> .q
> > > Lucknow:~/Data/ecoservices punkish$ less foo.csv
> > > "the first record"
> > > "\351 \347 \361 \356"
> > > "more from 3rd row"
> > > "row four"
> > > "these \223volunteered\224 activities"
> > > "<\341 \370 \343 \374 \356 & others>"
> > > foo.csv (END)
> > > -----------------------------
> > > sqlite> .mode html
> > > sqlite> .output foo.html
> > > sqlite> select * from tbl;
> > > sqlite> .q
> > > Lucknow:~/Data/ecoservices punkish$ less foo.html
> > > "foo.html" may be a binary file.  See it anyway?
> > > <TR><TD>the first record</TD>
> > > </TR>
> > > <TR><TD><E9> <E7> <F1> <EE></TD>
> > > </TR>
> > > <TR><TD>more from 3rd row</TD>
> > > </TR>
> > > <TR><TD>row four</TD>
> > > </TR>
> > > <TR><TD>these <93>volunteered<94> activities</TD>
> > > </TR>
> > > <TR><TD><<E1> <F8> <E3> <FC> <EE> & others></TD>
> > > </TR>
> > > foo.html (END)
> > > -----------------------------
> > > # below foo.html in my Cocoa-based text editor
> > > <TR><TD>the first record</TD>
> > > </TR>
> > > <TR><TD>È Á Ò Ó</TD>
> > > </TR>
> > > <TR><TD>more from 3rd row</TD>
> > > </TR>
> > > <TR><TD>row four</TD>
> > > </TR>
> > > <TR><TD>these ìvolunteeredî activities</TD>
> > > </TR>
> > > <TR><TD><· ¯ „ ¸ Ó & others></TD>
> > > </TR>
> > > -----------------------------
> > > # below foo.html in Safari; I added <TABLE> tags to format
>  correctly
> > > the first record
> > > é ç ñ î
> > > more from 3rd row
> > > row four
> > > these "volunteered" activities
> > > <á ø ã ü î & others>
> > >
> > >
>
  -----------------------------------------------------------------------------
> > > To unsubscribe, send email to [hidden email]
> > >
>
  -----------------------------------------------------------------------------

> > >
> > >
> >
>
>
> --
> Puneet Kishor
> http://punkish.eidesis.org/
> Nelson Institute for Environmental Studies
> http://www.nelson.wisc.edu/
> Open Source Geospatial Foundation (OSGeo)
> http://www.osgeo.org/
> Summer 2007 S&T Policy Fellow, The National Academies
> http://www.nas.edu/
>
>
 -----------------------------------------------------------------------------
> To unsubscribe, send email to [hidden email]
>
 -----------------------------------------------------------------------------
>
>
>
>
>
>
 -----------------------------------------------------------------------------
> To unsubscribe, send email to [hidden email]
>
 -----------------------------------------------------------------------------
>
>


--
Puneet Kishor
http://punkish.eidesis.org/
Nelson Institute for Environmental Studies
http://www.nelson.wisc.edu/
Open Source Geospatial Foundation (OSGeo)
http://www.osgeo.org/
Summer 2007 S&T Policy Fellow, The National Academies
http://www.nas.edu/

-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------





-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: SQLite and html character entities

Trevor Talbot-2
In reply to this post by Puneet Kishor-2
On 9/20/07, P Kishor <[hidden email]> wrote:

> sqlite> select * from tbl;
> the first record
> é ç ñ î
> more from 3rd row
> row four
> these "volunteered" activities
> <á ø ã ü î & others>
> -----------------------------
> sqlite> .mode csv
> sqlite> .output foo.csv
> sqlite> select * from tbl;
> sqlite> .q
> Lucknow:~/Data/ecoservices punkish$ less foo.csv
> "the first record"
> "\351 \347 \361 \356"
> "more from 3rd row"
> "row four"
> "these \223volunteered\224 activities"
> "<\341 \370 \343 \374 \356 & others>"
> foo.csv (END)
> -----------------------------

Note that this is *not* UTF-8.  If you're still using this as test
data, you need to get rid of it and use UTF-8 encoded data instead.

-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: SQLite and html character entities

Puneet Kishor-2
On 9/20/07, Trevor Talbot <[hidden email]> wrote:

> On 9/20/07, P Kishor <[hidden email]> wrote:
>
> > sqlite> select * from tbl;
> > the first record
> > é ç ñ î
> > more from 3rd row
> > row four
> > these "volunteered" activities
> > <á ø ã ü î & others>
> > -----------------------------
> > sqlite> .mode csv
> > sqlite> .output foo.csv
> > sqlite> select * from tbl;
> > sqlite> .q
> > Lucknow:~/Data/ecoservices punkish$ less foo.csv
> > "the first record"
> > "\351 \347 \361 \356"
> > "more from 3rd row"
> > "row four"
> > "these \223volunteered\224 activities"
> > "<\341 \370 \343 \374 \356 & others>"
> > foo.csv (END)
> > -----------------------------
>
> Note that this is *not* UTF-8.  If you're still using this as test
> data, you need to get rid of it and use UTF-8 encoded data instead.
>

this is where I lost you... when you say "this" is not UTF8, what is
"this"? All I want is that I want (1) the user to be able to type ç in
the web form, and (2) I want to be able to save ç in the db. (3) Then
when I look at that data, either on the command line, but definitely
back on the web, I want it to appear as ç. (4) If I export it, I
should still be able to see it as ç and not something else.

Seems like I was able to do 1, 2, and 3 with my test case, but not 4
(I got \347 instead ç).

Also, in my production case, 1,2, and 3 are not very reliable. Are you
saying my data above are not UTF8? If so, I would like to know how you
can tell that, so I can recognize it in the future myself. Also, I
would like to know how I can do what you are suggesting I should do,
that is, how can I ensure that I "use UTF8 encoded data"?

Many thanks,


--
Puneet Kishor
http://punkish.eidesis.org/
Nelson Institute for Environmental Studies
http://www.nelson.wisc.edu/
Open Source Geospatial Foundation (OSGeo)
http://www.osgeo.org/
Summer 2007 S&T Policy Fellow, The National Academies
http://www.nas.edu/

-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: SQLite and html character entities

Trevor Talbot-2
On 9/20/07, P Kishor <[hidden email]> wrote:
> On 9/20/07, Trevor Talbot <[hidden email]> wrote:
> > On 9/20/07, P Kishor <[hidden email]> wrote:

> > > Lucknow:~/Data/ecoservices punkish$ less foo.csv
> > > "the first record"
> > > "\351 \347 \361 \356"
> > > "more from 3rd row"
> > > "row four"
> > > "these \223volunteered\224 activities"
> > > "<\341 \370 \343 \374 \356 & others>"
> > > foo.csv (END)
> > > -----------------------------
> >
> > Note that this is *not* UTF-8.  If you're still using this as test
> > data, you need to get rid of it and use UTF-8 encoded data instead.

> this is where I lost you... when you say "this" is not UTF8, what is
> "this"?

The data in the file shown by less, and since sqlite3 exported that
data exactly as it was stored, the data in the db as well.

> All I want is that I want (1) the user to be able to type ç in
> the web form, and (2) I want to be able to save ç in the db. (3) Then
> when I look at that data, either on the command line, but definitely
> back on the web, I want it to appear as ç. (4) If I export it, I
> should still be able to see it as ç and not something else.
>
> Seems like I was able to do 1, 2, and 3 with my test case, but not 4
> (I got \347 instead ç).
>
> Also, in my production case, 1,2, and 3 are not very reliable. Are you
> saying my data above are not UTF8? If so, I would like to know how you
> can tell that, so I can recognize it in the future myself. Also, I
> would like to know how I can do what you are suggesting I should do,
> that is, how can I ensure that I "use UTF8 encoded data"?

Okay, first a quick primer on character sets and encodings.  A byte
can hold one of 256 different values (0-255), and most processing
tends to happen on bytes, so it makes sense that individual characters
should be stored as individual bytes.

First we have US ASCII, the character encoding standard that defines
128 characters, including the basic english alphabet, numbers, and
some punctuation (www.asciitable.com).  However, this obviously
doesn't cover all the symbols in common use, or characters from other
languages, so more definitions are needed.  Given that a byte supports
twice as many values (ASCII takes up only half), that leaves 128
values for other purposes.  Many other character sets keep the bottom
half as ASCII, and assign different characters to the top 128 values.
The ISO-8859 family of standards works this way.

ISO-8859-1 is also known as Latin-1, and is most common for languages
that use characters similar to English, Spanish, etc.  It adds a few
more symbols (copyright, paragraph, etc) and some common characters
with diacritical marks (like é ç ñ î).  The data you posted above was
entered into your database using this encoding (or Windows-1252, which
is identical except for adding some characters in places 8859-1 does
not use).

ISO-8859-2 is also known as Latin-2, and covers another set of
European languages (such as Romanian).  It contains a different set of
symbols and characters with diacritical marks needed for these
languages, characters that don't fit in 8859-1.

It keeps going, of course (Wikipedia has info:
http://en.wikipedia.org/wiki/Category:ISO_8859).  There are many other
encodings that work this way, and collectively they're known as
single-byte encodings: they all represent a character as a single
byte, but the actual meaning of that byte depends in the character set
in use.

This situation is ripe for confusion, since interpreting a sequence of
bytes as being in a different encoding than it was stored in will lead
to strange results.  This is exactly what you saw in your Cocoa
editor, since it defaulted to using the classic MacRoman encoding,
which uses those same byte values to store uppercase characters
instead.

It gets worse: there are multi-byte encodings too.  You typically see
these in the East Asian languages, since they don't use the same
alphabetic writing system, and instead have thousands of characters to
encode.  A byte only supports a mere 256 values, so more than one byte
is needed to represent a single character.

By now you can see how this can spiral into an unmaintainable mess:
you have to worry about this encoding and that encoding and you can
store the encoding with the text but what do you do if someone
requests data in another encoding and what if they are using a
specific encoding but that text only contains ASCII characters and
therefore everyone should see it anyway and how do you tell the
difference and *brain asplode*

Enter Unicode, which has the goal of putting all the world's commonly
used language characters and symbols into one single character set.
By using Unicode, you don't have to worry about which character set
your data is in, and you can move on to other more interesting issues.
 Of course, it's a very large character set, supporting just over 1
million characters.  Obviously these don't all fit in one byte, so
there are also several standard encodings.  UTF-8 is one of those, and
the most common one for Internet use.

UTF-8 was designed to be ASCII-friendly: the first 128 byte values
(0-127) are identical to ASCII.  All Unicode characters beyond those
first 128 are represented using 2 or more bytes in a row, with each
byte having a value in the range 128-255.

The tool you used to look at the CSV file you wrote out, less, is
designed for ASCII use.  When it encounters a byte it can't print as
an ASCII character, it displays the value using an octal escape
sequence instead (\351 etc).  (Many other unixy tools accept such
escape sequences as input, so even if you can't print or type such
characters you can still work with them using your keyboard.)

With the above two items in mind, that's how I knew it wasn't UTF-8
right away: you showed 4 characters that were not in the ASCII range.
In UTF-8, they would require at least 2 bytes each for a total of 8+
bytes, but the output from less showed only 4 bytes.  Therefore, it's
some single-byte encoding, and not UTF-8.  (Familiarity with the
issues and some staring at code charts on Wikipedia let me reach the
conclusions about ISO-8859-1/Win-1252 and MacRoman, but you don't need
to go that far just to check for UTF-8.)

Okay, so now you want to make sure you always use UTF-8.  I'll defer
to others for the web situation, as I simply don't have enough
experience there.  They're probably right that at this point you need
to look at the Perl CGI side of things.

As for working with the database directly, probably the easiest thing
you can do is change the encoding Terminal.app uses.  I'm on Panther
and don't have a Tiger machine to test with, but as far as I'm aware
it works the same way.  Under the Terminal -> Window Settings menu,
Display section, you can set the encoding to UTF-8.  This will cause
it to translate byte sequences on both input and display.

With that change made, try starting sqlite3 and inserting those
characters above directly on the command line.  Do the CSV output
again, then you can repeat the viewing with less and your Cocoa editor
to see if you get different results.  There's another easy check you
can do though, with the file utility:

    file foo.csv

It should say something like "foo.csv: UTF-8 Unicode text" if those
characters were output as UTF-8, and "foo.csv: ISO-8859 text" for the
file you originally made.

Once you have some sample data stored in sqlite that way, you'll be
able to test the storage and display parts of your web app separately,
which should help narrow down problems you encounter there.

There are also some third-party GUI database management tools; perhaps
someone here can recommend a good one for OS X that supports UTF-8?
I've yet to try any myself.

-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: SQLite and html character entities

Trevor Talbot-2
I should also mention, a text editor I like to use is SubEthaEdit (I
have the old 2.2 version), and it supports switching encodings via the
Format menu.  If you're switching to find the encoding of an existing
file, just choose Reinterpret when it asks.

-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: SQLite and html character entities

Clark Christensen
In reply to this post by Puneet Kishor-2
Wow!  Excellent summary, Trevor.

----- Original Message ----
From: Trevor Talbot <[hidden email]>
To: [hidden email]
Sent: Thursday, September 20, 2007 11:35:42 PM
Subject: Re: [sqlite] SQLite and html character entities

On 9/20/07, P Kishor <[hidden email]> wrote:
> On 9/20/07, Trevor Talbot <[hidden email]> wrote:
> > On 9/20/07, P Kishor <[hidden email]> wrote:

> > > Lucknow:~/Data/ecoservices punkish$ less foo.csv
> > > "the first record"
> > > "\351 \347 \361 \356"
> > > "more from 3rd row"
> > > "row four"
> > > "these \223volunteered\224 activities"
> > > "<\341 \370 \343 \374 \356 & others>"
> > > foo.csv (END)
> > > -----------------------------
> >
> > Note that this is *not* UTF-8.  If you're still using this as test
> > data, you need to get rid of it and use UTF-8 encoded data instead.

> this is where I lost you... when you say "this" is not UTF8, what is
> "this"?

The data in the file shown by less, and since sqlite3 exported that
data exactly as it was stored, the data in the db as well.

> All I want is that I want (1) the user to be able to type ç in
> the web form, and (2) I want to be able to save ç in the db. (3)
 Then
> when I look at that data, either on the command line, but definitely
> back on the web, I want it to appear as ç. (4) If I export it, I
> should still be able to see it as ç and not something else.
>
> Seems like I was able to do 1, 2, and 3 with my test case, but not 4
> (I got \347 instead ç).
>
> Also, in my production case, 1,2, and 3 are not very reliable. Are
 you
> saying my data above are not UTF8? If so, I would like to know how
 you
> can tell that, so I can recognize it in the future myself. Also, I
> would like to know how I can do what you are suggesting I should do,
> that is, how can I ensure that I "use UTF8 encoded data"?

Okay, first a quick primer on character sets and encodings.  A byte
can hold one of 256 different values (0-255), and most processing
tends to happen on bytes, so it makes sense that individual characters
should be stored as individual bytes.

First we have US ASCII, the character encoding standard that defines
128 characters, including the basic english alphabet, numbers, and
some punctuation (www.asciitable.com).  However, this obviously
doesn't cover all the symbols in common use, or characters from other
languages, so more definitions are needed.  Given that a byte supports
twice as many values (ASCII takes up only half), that leaves 128
values for other purposes.  Many other character sets keep the bottom
half as ASCII, and assign different characters to the top 128 values.
The ISO-8859 family of standards works this way.

ISO-8859-1 is also known as Latin-1, and is most common for languages
that use characters similar to English, Spanish, etc.  It adds a few
more symbols (copyright, paragraph, etc) and some common characters
with diacritical marks (like é ç ñ î).  The data you posted above
 was
entered into your database using this encoding (or Windows-1252, which
is identical except for adding some characters in places 8859-1 does
not use).

ISO-8859-2 is also known as Latin-2, and covers another set of
European languages (such as Romanian).  It contains a different set of
symbols and characters with diacritical marks needed for these
languages, characters that don't fit in 8859-1.

It keeps going, of course (Wikipedia has info:
http://en.wikipedia.org/wiki/Category:ISO_8859).  There are many other
encodings that work this way, and collectively they're known as
single-byte encodings: they all represent a character as a single
byte, but the actual meaning of that byte depends in the character set
in use.

This situation is ripe for confusion, since interpreting a sequence of
bytes as being in a different encoding than it was stored in will lead
to strange results.  This is exactly what you saw in your Cocoa
editor, since it defaulted to using the classic MacRoman encoding,
which uses those same byte values to store uppercase characters
instead.

It gets worse: there are multi-byte encodings too.  You typically see
these in the East Asian languages, since they don't use the same
alphabetic writing system, and instead have thousands of characters to
encode.  A byte only supports a mere 256 values, so more than one byte
is needed to represent a single character.

By now you can see how this can spiral into an unmaintainable mess:
you have to worry about this encoding and that encoding and you can
store the encoding with the text but what do you do if someone
requests data in another encoding and what if they are using a
specific encoding but that text only contains ASCII characters and
therefore everyone should see it anyway and how do you tell the
difference and *brain asplode*

Enter Unicode, which has the goal of putting all the world's commonly
used language characters and symbols into one single character set.
By using Unicode, you don't have to worry about which character set
your data is in, and you can move on to other more interesting issues.
 Of course, it's a very large character set, supporting just over 1
million characters.  Obviously these don't all fit in one byte, so
there are also several standard encodings.  UTF-8 is one of those, and
the most common one for Internet use.

UTF-8 was designed to be ASCII-friendly: the first 128 byte values
(0-127) are identical to ASCII.  All Unicode characters beyond those
first 128 are represented using 2 or more bytes in a row, with each
byte having a value in the range 128-255.

The tool you used to look at the CSV file you wrote out, less, is
designed for ASCII use.  When it encounters a byte it can't print as
an ASCII character, it displays the value using an octal escape
sequence instead (\351 etc).  (Many other unixy tools accept such
escape sequences as input, so even if you can't print or type such
characters you can still work with them using your keyboard.)

With the above two items in mind, that's how I knew it wasn't UTF-8
right away: you showed 4 characters that were not in the ASCII range.
In UTF-8, they would require at least 2 bytes each for a total of 8+
bytes, but the output from less showed only 4 bytes.  Therefore, it's
some single-byte encoding, and not UTF-8.  (Familiarity with the
issues and some staring at code charts on Wikipedia let me reach the
conclusions about ISO-8859-1/Win-1252 and MacRoman, but you don't need
to go that far just to check for UTF-8.)

Okay, so now you want to make sure you always use UTF-8.  I'll defer
to others for the web situation, as I simply don't have enough
experience there.  They're probably right that at this point you need
to look at the Perl CGI side of things.

As for working with the database directly, probably the easiest thing
you can do is change the encoding Terminal.app uses.  I'm on Panther
and don't have a Tiger machine to test with, but as far as I'm aware
it works the same way.  Under the Terminal -> Window Settings menu,
Display section, you can set the encoding to UTF-8.  This will cause
it to translate byte sequences on both input and display.

With that change made, try starting sqlite3 and inserting those
characters above directly on the command line.  Do the CSV output
again, then you can repeat the viewing with less and your Cocoa editor
to see if you get different results.  There's another easy check you
can do though, with the file utility:

    file foo.csv

It should say something like "foo.csv: UTF-8 Unicode text" if those
characters were output as UTF-8, and "foo.csv: ISO-8859 text" for the
file you originally made.

Once you have some sample data stored in sqlite that way, you'll be
able to test the storage and display parts of your web app separately,
which should help narrow down problems you encounter there.

There are also some third-party GUI database management tools; perhaps
someone here can recommend a good one for OS X that supports UTF-8?
I've yet to try any myself.

-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------





-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

test failures on cygwin

Evans, Mark (Tandem)
Hi all,

I have been lurking on the message board and am in awe of the collective
wisdom.

I'm just getting my feet wet learning the internals of SQLite, drinking
from the proverbial firehose.  I am using cygwin 1.90 as my learning
platform and I have built SQLite 3.5.0.   I ran 'make test' (quick.test
suite) and it reports some test failures as follows:

41 errors out of 28604 tests
Failures on these tests: autoinc-6.1 bind-3.1 cast-3.1 cast-3.2 cast-3.4
cast-3.5 cast-3.6 cast-3.8 cast-3.11 cast-3.12 cast-3.14 cast-3.15
cast-3.16 cast-3.18 cast-3.21 cast-3.22 cast-3.24 expr-1.102 expr-1.106
func-18.14 func-18.16 func-18.17 func-18.31 lastinsert-8.1
lastinsert-9.1 misc1-9.1 misc2-4.1 misc2-4.2 misc2-4.4 misc2-4.5
misc2-4.6 misc3-3.6 misc3-3.7 misc3-3.8 misc3-3.9 misc3-3.10 misc3-3.11
misc5-2.2 shared-1.11.9 shared-2.11.9 types-2.1.8

Should I be surprised by these failures?  

Taking the first one that failed, autoinc-6.1, it reported:

autoinc-6.1...
Expected: [9223372036854775807]
     Got: [-1]

The corresonding test code is:

ifcapable {!rowid32} {
  do_test autoinc-6.1 {
    execsql {
      CREATE TABLE t6(v INTEGER PRIMARY KEY AUTOINCREMENT, w);
      INSERT INTO t6 VALUES(9223372036854775807,1);
      SELECT seq FROM main.sqlite_sequence WHERE name='t6';
    }
  } 9223372036854775807
}

So the SELECT clause returns -1 instead of the expected
0x7FFFFFFFFFFFFFFF (MAXLONGLONG), or in other words MAXLONGLONG + 1.
Should the value returned by SELECT seq ... be the the key value (rowid)
of the last insert, or, is seq supposed to represent the next rowid
(autoincrement of last key)?

Regards,
Mark


-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

RE: test failures on cygwin

Evans, Mark (Tandem)
When I  run just autoinc.test alone,  I get the following summary
output:

54 errors out of 66 tests
Failures on these tests: autoinc-6.1 autoinc-1.1 autoinc-1.2 autoinc-1.3
autoinc-1.4 autoinc-1.6 autoinc-2.1 autoinc-2.2 autoinc-2.3 autoinc-2.4
autoinc-2.5 autoinc-2.6 autoinc-2.7 autoinc-2.8 autoinc-2.9 autoinc-2.10
autoinc-2.11 autoinc-2.12 autoinc-2.13 autoinc-2.21 autoinc-2.23
autoinc-2.25 autoinc-2.27 autoinc-2.29 autoinc-2.51 autoinc-2.52
autoinc-2.53 autoinc-2.54 autoinc-2.55 autoinc-2.70 autoinc-2.71
autoinc-2.72 autoinc-2.73 autoinc-2.74 autoinc-3.1 autoinc-3.2
autoinc-3.3 autoinc-3.4 autoinc-4.1 autoinc-4.2 autoinc-4.3 autoinc-4.4
autoinc-4.4.1 autoinc-4.5 autoinc-4.6 autoinc-4.7 autoinc-4.8
autoinc-4.9 autoinc-4.10 autoinc-5.1 autoinc-5.2 autoinc-5.3 autoinc-5.4
autoinc-7.1
******************************************************************
N.B.:  The version of TCL that you used to build this test harness
is defective in that it does not support 64-bit integers.  Some or
all of the test failures above might be a result from this defect
in your TCL build.
******************************************************************
All memory allocations freed - no leaks
Maximum memory usage: 76060 bytes


So it looks like I have a cygwin TCL issue.  Is this fixable?
 
Mark

> -----Original Message-----
> From: Evans, Mark (Tandem)
> Sent: Friday, September 21, 2007 5:47 PM
> To: [hidden email]
> Subject: [sqlite] test failures on cygwin
>
> Hi all,
>
> I have been lurking on the message board and am in awe of the
> collective wisdom.
>
> I'm just getting my feet wet learning the internals of
> SQLite, drinking from the proverbial firehose.  I am using
> cygwin 1.90 as my learning
> platform and I have built SQLite 3.5.0.   I ran 'make test'
> (quick.test
> suite) and it reports some test failures as follows:
>
> 41 errors out of 28604 tests
> Failures on these tests: autoinc-6.1 bind-3.1 cast-3.1
> cast-3.2 cast-3.4
> cast-3.5 cast-3.6 cast-3.8 cast-3.11 cast-3.12 cast-3.14 cast-3.15
> cast-3.16 cast-3.18 cast-3.21 cast-3.22 cast-3.24 expr-1.102
> expr-1.106
> func-18.14 func-18.16 func-18.17 func-18.31 lastinsert-8.1
> lastinsert-9.1 misc1-9.1 misc2-4.1 misc2-4.2 misc2-4.4 misc2-4.5
> misc2-4.6 misc3-3.6 misc3-3.7 misc3-3.8 misc3-3.9 misc3-3.10
> misc3-3.11
> misc5-2.2 shared-1.11.9 shared-2.11.9 types-2.1.8
>
> Should I be surprised by these failures?  
>
> Taking the first one that failed, autoinc-6.1, it reported:
>
> autoinc-6.1...
> Expected: [9223372036854775807]
>      Got: [-1]
>
> The corresonding test code is:
>
> ifcapable {!rowid32} {
>   do_test autoinc-6.1 {
>     execsql {
>       CREATE TABLE t6(v INTEGER PRIMARY KEY AUTOINCREMENT, w);
>       INSERT INTO t6 VALUES(9223372036854775807,1);
>       SELECT seq FROM main.sqlite_sequence WHERE name='t6';
>     }
>   } 9223372036854775807
> }
>
> So the SELECT clause returns -1 instead of the expected
> 0x7FFFFFFFFFFFFFFF (MAXLONGLONG), or in other words MAXLONGLONG + 1.
> Should the value returned by SELECT seq ... be the the key
> value (rowid) of the last insert, or, is seq supposed to
> represent the next rowid (autoincrement of last key)?
>
> Regards,
> Mark
>
>
> --------------------------------------------------------------
> ---------------
> To unsubscribe, send email to [hidden email]
> --------------------------------------------------------------
> ---------------
>
>

-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: test failures on cygwin

D. Richard Hipp
In reply to this post by Evans, Mark (Tandem)
"Evans, Mark (Tandem)" <[hidden email]> wrote:

> Hi all,
>
> I have been lurking on the message board and am in awe of the collective
> wisdom.
>
> I'm just getting my feet wet learning the internals of SQLite, drinking
> from the proverbial firehose.  I am using cygwin 1.90 as my learning
> platform and I have built SQLite 3.5.0.   I ran 'make test' (quick.test
> suite) and it reports some test failures as follows:
>
> 41 errors out of 28604 tests
> Failures on these tests: autoinc-6.1 bind-3.1 cast-3.1 cast-3.2 cast-3.4
> cast-3.5 cast-3.6 cast-3.8 cast-3.11 cast-3.12 cast-3.14 cast-3.15
> cast-3.16 cast-3.18 cast-3.21 cast-3.22 cast-3.24 expr-1.102 expr-1.106
> func-18.14 func-18.16 func-18.17 func-18.31 lastinsert-8.1
> lastinsert-9.1 misc1-9.1 misc2-4.1 misc2-4.2 misc2-4.4 misc2-4.5
> misc2-4.6 misc3-3.6 misc3-3.7 misc3-3.8 misc3-3.9 misc3-3.10 misc3-3.11
> misc5-2.2 shared-1.11.9 shared-2.11.9 types-2.1.8
>
> Should I be surprised by these failures?  
>
> Taking the first one that failed, autoinc-6.1, it reported:
>
> autoinc-6.1...
> Expected: [9223372036854775807]
>      Got: [-1]
>
> The corresonding test code is:
>
> ifcapable {!rowid32} {
>   do_test autoinc-6.1 {
>     execsql {
>       CREATE TABLE t6(v INTEGER PRIMARY KEY AUTOINCREMENT, w);
>       INSERT INTO t6 VALUES(9223372036854775807,1);
>       SELECT seq FROM main.sqlite_sequence WHERE name='t6';
>     }
>   } 9223372036854775807
> }
>
> So the SELECT clause returns -1 instead of the expected
> 0x7FFFFFFFFFFFFFFF (MAXLONGLONG), or in other words MAXLONGLONG + 1.
> Should the value returned by SELECT seq ... be the the key value (rowid)
> of the last insert, or, is seq supposed to represent the next rowid
> (autoincrement of last key)?
>

41 failures out of 28604 isn't bad.  That's a 0.1% failure rate.  If
these are the only problems you are having, your build is probably OK.

If you want to go for 100% pass, the first place I would look is what
version of TCL you are linking against to build the test harness.
You want a later version of 8.4 or any version of 8.5.  I'll bet you
have 8.3, which won't work here.

The other thing you might try to do is *not* use cygwin but instead
use mingw.

Of course, the easiest option by far is to use a Linux box or a Mac. :-)

--
D. Richard Hipp <[hidden email]>



-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: test failures on cygwin

D. Richard Hipp
In reply to this post by Evans, Mark (Tandem)
"Evans, Mark (Tandem)" <[hidden email]> wrote:
> ******************************************************************
> N.B.:  The version of TCL that you used to build this test harness
> is defective in that it does not support 64-bit integers.  Some or
> all of the test failures above might be a result from this defect
> in your TCL build.
> ******************************************************************

Yeah.  That's what I figured.  You need to fix you TCL build.

It's *still* not to late to switch to Linux or Mac  :-)

--
D. Richard Hipp <[hidden email]>


-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: test failures on cygwin

Gerry Snyder-4
In reply to this post by Evans, Mark (Tandem)
Evans, Mark (Tandem) wrote:
> ....
>
>
> So it looks like I have a cygwin TCL issue.  Is this fixable?
You can load Active State Tcl and use that instead of the version that
comes with Cygwin.

You get lots of extra goodies, in addition to an up-to-date Tcl core.

HTH,

Gerry

-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: test failures on cygwin

jreidthompson
In reply to this post by D. Richard Hipp

> If you want to go for 100% pass, the first place I would look is what
> version of TCL you are linking against to build the test harness.
> You want a later version of 8.4 or any version of 8.5.  I'll bet you
> have 8.3, which won't work here.
>
> The other thing you might try to do is *not* use cygwin but instead
> use mingw.
>
> Of course, the easiest option by far is to use a Linux box or a Mac. :-)
>
> --
> D. Richard Hipp <[hidden email]>
>
>
>
> -----------------------------------------------------------------------------
> To unsubscribe, send email to [hidden email]
> -----------------------------------------------------------------------------

Looks like the current version available via cygwin setup is 8.4

-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

RE: test failures on cygwin

Evans, Mark (Tandem)
In reply to this post by D. Richard Hipp
It's hard to drag my Linux server to Starbucks.  :-)  

Next Windows laptop, though, will definitely have to have a
Linux/Windows dual personality.  I'll have to wait until the next
mega-merger for a Mac.

Mark

> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]]
> Sent: Friday, September 21, 2007 6:05 PM
> To: [hidden email]
> Subject: Re: [sqlite] test failures on cygwin
>
<snip>
>
> Of course, the easiest option by far is to use a Linux box or
> a Mac. :-)
>
> --
> D. Richard Hipp <[hidden email]>
>

-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: test failures on cygwin

inq1ltd
On Saturday 22 September 2007 14:20, Evans, Mark (Tandem) wrote:
> It's hard to drag my Linux server to Starbucks.  :-)
>
> Next Windows laptop, though, will definitely have to have a
> Linux/Windows dual personality.  I'll have to wait until the next
> mega-merger for a Mac.
>
> Mark
>

If your  HD has the space load Suse Linux from Novel and run Windos or
Linux.  You can down load for free Novel's latest version.

jim-on-linux
http://www.inqvista.com





> > -----Original Message-----
> > From: [hidden email] [mailto:[hidden email]]
> > Sent: Friday, September 21, 2007 6:05 PM
> > To: [hidden email]
> > Subject: Re: [sqlite] test failures on cygwin
>
> <snip>
>
> > Of course, the easiest option by far is to use a Linux box or
> > a Mac. :-)
> >
> > --
> > D. Richard Hipp <[hidden email]>
>
> -------------------------------------------------------------------
>---------- To unsubscribe, send email to
> [hidden email]
> -------------------------------------------------------------------
>----------

-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------