Encoding question

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Encoding question

phaworth
I'm new to the unicode world so this question may not make sense.

The "PRAGMA encoding" statement tells me the encoding of a database. Can I
rely on all data in the database having that encoding? For example, if the
encoding is UTF8 and a row is inserted containing UTF16 encoded data, will
it still end up as UTF8 data in the database?

Thanks,
Pete
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Encoding question

R Smith

On 2015/02/02 01:12, Peter Haworth wrote:
> I'm new to the unicode world so this question may not make sense.
>
> The "PRAGMA encoding" statement tells me the encoding of a database. Can I
> rely on all data in the database having that encoding? For example, if the
> encoding is UTF8 and a row is inserted containing UTF16 encoded data, will
> it still end up as UTF8 data in the database?

Not exactly sure if I understand what you are asking, so this answer may make even less sense!

If you are asking whether or not you can rely on the data which you do not control, inside a DB set to UTF-8, to always be UTF-8,
then the simple answer is: Obviously not, it holds whatever the user (or program using the DB) stores in there however he/she/it
stores it. What the encoding means simply is that the database engine (SQLite in this case) will treat the data in the database as
if it is UTF-8 data. It will store, compare, collate, order, extract and do all the other things with the expectation that the data
will conform to the UTF-8 standard and as such return to you (the user) valid UTF-8 based answers.

None of this prevents you from sticking a BLOB or some UTF-16 or indeed any other Unicode text in there but you might find the
storing and returning of the values and query answers are not exactly what you expected with the format being somewhat off.

In short, the UTF-8 Pragma settings /allows/ your data to be interpreted as such. It doesn't /force/ it, nor magically /converts/
the data into UTF-8, and it most certainly does not under any circumstances *guarantee* the UTF-8-ness of data. (Though it does
guarantee that /IF/ you put valid UTF-8 data in there, it will be handled and returned correctly).


_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Encoding question

Stephan Beal-3
On Mon, Feb 2, 2015 at 12:49 AM, RSmith <[hidden email]> wrote:

> guarantee that /IF/ you put valid UTF-8 data in there, it will be handled
> and returned correctly).
>

For a given definition of "correct." A relatively common topic on this list
is the handling of locale-specific collations (a topic i'm not qualified to
comment on), with "ICU extension" being a common part of any answers.

Google says:

http://www.sqlite.org/src/info/d9fbbad0c2f647c3fdf715fc9fd64af53aedfc43

--
----- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
"Freedom is sloppy. But since tyranny's the only guaranteed byproduct of
those who insist on a perfect world, freedom will have to do." -- Bigby Wolf
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Encoding question

Richard Hipp-3
In reply to this post by phaworth
On 2/1/15, Peter Haworth <[hidden email]> wrote:
> I'm new to the unicode world so this question may not make sense.
>
> The "PRAGMA encoding" statement tells me the encoding of a database. Can I
> rely on all data in the database having that encoding? For example, if the
> encoding is UTF8 and a row is inserted containing UTF16 encoded data, will
> it still end up as UTF8 data in the database?
>

If the database encoding is UTF8 and you insert data using
sqlite3_bind_text16() or a similar function, then SQLite automatically
converts UTF16 to UTF8.  Likewise, if you extract data using
sqlite3_column_text16() or a similar routine, then the UTF8 to UTF16
conversion is automatic.

If you run an INSERT statement that has the data as part of the INSERT
statement itself (not normally recommended due to the risk of SQL
injection if you fail to properly escape the data) then any necessary
conversions are also automatic.

--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Encoding question

phaworth
In reply to this post by phaworth
On Mon, Feb 2, 2015 at 9:00 AM, <[hidden email]> wrote:

> From: RSmith <[hidden email]>
> To: [hidden email]
> Subject: Re: [sqlite] Encoding question
> Message-ID: <[hidden email]>
> Content-Type: text/plain; charset=windows-1252; format=flowed
>
> In short, the UTF-8 Pragma settings /allows/ your data to be interpreted
> as such. It doesn't /force/ it, nor magically /converts/
> the data into UTF-8, and it most certainly does not under any
> circumstances *guarantee* the UTF-8-ness of data. (Though it does
> guarantee that /IF/ you put valid UTF-8 data in there, it will be handled
> and returned correctly).
>

Thanks for this and the other responses, makes sense.  I suppose it's
similar to putting non-integer data into an INTEGER column.

This is in the context of an SQLite utility I sell which I'm trying to make
unicode compatible so I have no control over the data in the database, just
have to interpret it the best I can.  I've seen that there are algorithms
out there that will detect different encodings but it seems that the
algorithms are not 100% reliable.

I should also have mentioned that the question also included table names,
column names, constraint names, etc, but I'll assume the same applies to
them as for the data.

Pete
lcSQL Software <http://www.lcsql.com>
Home of lcStackBrowser <http://www.lcsql.com/lcstackbrowser.html> and
SQLiteAdmin <http://www.lcsql.com/sqliteadmin.html>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Encoding question

Richard Hipp-3
On 2/2/15, Peter Haworth <[hidden email]> wrote:

> I should also have mentioned that the question also included table names,
> column names, constraint names, etc, but I'll assume the same applies to
> them as for the data.
>

The schema is *always* in the database encoding.  The SQLite parser
wouldn't be able to understand it otherwise, and would have rejected
it, so it never would have made it into the sqlite_master table.

--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Encoding question

R Smith
In reply to this post by phaworth

On 2015/02/02 19:37, Peter Haworth wrote:

> On Mon, Feb 2, 2015 at 9:00 AM, <[hidden email]> wrote:
>
>> From: RSmith <[hidden email]>
>> To: [hidden email]
>> Subject: Re: [sqlite] Encoding question
>> Message-ID: <[hidden email]>
>> Content-Type: text/plain; charset=windows-1252; format=flowed
>>
>> In short, the UTF-8 Pragma settings /allows/ your data to be interpreted
>> as such. It doesn't /force/ it, nor magically /converts/
>> the data into UTF-8, and it most certainly does not under any
>> circumstances *guarantee* the UTF-8-ness of data. (Though it does
>> guarantee that /IF/ you put valid UTF-8 data in there, it will be handled
>> and returned correctly).
>>
> Thanks for this and the other responses, makes sense.  I suppose it's
> similar to putting non-integer data into an INTEGER column.
>
> This is in the context of an SQLite utility I sell which I'm trying to make
> unicode compatible so I have no control over the data in the database, just
> have to interpret it the best I can.  I've seen that there are algorithms
> out there that will detect different encodings but it seems that the
> algorithms are not 100% reliable.
>
> I should also have mentioned that the question also included table names,
> column names, constraint names, etc, but I'll assume the same applies to
> them as for the data.

Good news here is that if you do set the DB to be UTF-8 compatible (with the discussed Pragma) all your table names, column names
and database objects in general are very much UTF-8 enabled. You can name a table in Hebrew or Chinese without any issues and fill
them with UTF8 data (as long as your program takes care of adding the data in correct UTF8, it willl get it back in correct UTF8).

Proof of concept - here is a script I made quickly adding some poetry from different nations to a table called Des Garçons (Boys)
with two columns having names in Braille and Russian with an Index in Chinese, and querying it using other UTF-8 SQL statements. You
should be able to copy-paste this and run it through another SQLite engine on any UTF8 enabled DB:

(I hope the mail forum reproduces this right)


   -- Processing Script for File: E:\Documents\SQLiteScripts\UTF8_Test.sql
   -- Script Items: 7          Parameter Count: 0
   -- 2015-02-02 20:48:22.804  |  [Success]    Script Started...
   -- ================================================================================================

DROP TABLE IF EXISTS "Garçons";
CREATE TABLE "Garçons" (
   "ID" INTEGER PRIMARY KEY,
   "⠝⠙⠊⠞" TEXT,
   "пустынных" TEXT
);

CREATE INDEX "我能吞下" ON "Garçons" ("ID","⠝⠙⠊⠞");

INSERT INTO "Garçons" VALUES
  (1,'From the Anglo-Saxon Rune Poem (Rune version):',
'ᚠᛇᚻ᛫ᛒᛦᚦ᛫ᚠᚱᚩᚠᚢᚱ᛫ᚠᛁᚱᚪ᛫ᚷᛖᚻᚹᛦᛚᚳᚢᛗ
ᛋᚳᛖᚪᛚ᛫ᚦᛖᚪᚻ᛫ᛗᚪᚾᚾᚪ᛫ᚷᛖᚻᚹᛦᛚᚳ᛫ᛗᛁᚳᛚᚢᚾ᛫ᚻᛦᛏ᛫ᛞᚫᛚᚪᚾ
ᚷᛁᚠ᛫ᚻᛖ᛫ᚹᛁᛚᛖ᛫ᚠᚩᚱ᛫ᛞᚱᛁᚻᛏᚾᛖ᛫ᛞᚩᛗᛖᛋ᛫ᚻᛚᛇᛏᚪᚾ᛬')

,(2,'From Laȝamon''s Brut (The Chronicles of England, Middle English, West Midlands): ',
'An preost wes on leoden, Laȝamon was ihoten
He wes Leovenaðes sone -- liðe him be Drihten.
He wonede at Ernleȝe at æðelen are chirechen,
Uppen Sevarne staþe, sel þar him þuhte,
Onfest Radestone, þer he bock radde.')

,(3,'From the Tagelied of Wolfram von Eschenbach (Middle High German): ',
'Sîne klâwen durh die wolken sint geslagen,
er stîget ûf mit grôzer kraft,
ich sih in grâwen tägelîch als er wil tagen,
den tac, der im geselleschaft
erwenden wil, dem werden man,
den ich mit sorgen în verliez.
ich bringe in hinnen, ob ich kan.
sîn vil manegiu tugent michz leisten hiez.');


SELECT ID,"⠝⠙⠊⠞",Char(13)||"пустынных"
FROM "Garçons" WHERE "⠝⠙⠊⠞"<>'';

   -- ID    ⠝⠙⠊⠞    Char(13)||"пустынных"
   -- --    ----    ---------------------
   -- 1    From the Anglo-Saxon Rune Poem (Rune version):
ᚠᛇᚻ᛫ᛒᛦᚦ᛫ᚠᚱᚩᚠᚢᚱ᛫ᚠᛁᚱᚪ᛫ᚷᛖᚻᚹᛦᛚᚳᚢᛗ
ᛋᚳᛖᚪᛚ᛫ᚦᛖᚪᚻ᛫ᛗᚪᚾᚾᚪ᛫ᚷᛖᚻᚹᛦᛚᚳ᛫ᛗᛁᚳᛚᚢᚾ᛫ᚻᛦᛏ᛫ᛞᚫᛚᚪᚾ
ᚷᛁᚠ᛫ᚻᛖ᛫ᚹᛁᛚᛖ᛫ᚠᚩᚱ᛫ᛞᚱᛁᚻᛏᚾᛖ᛫ᛞᚩᛗᛖᛋ᛫ᚻᛚᛇᛏᚪᚾ᛬

   -- 2    From Laȝamon's Brut (The Chronicles of England, Middle English, West Midlands):
An preost wes on leoden, Laȝamon was ihoten
He wes Leovenaðes sone -- liðe him be Drihten.
He wonede at Ernleȝe at æðelen are chirechen,
Uppen Sevarne staþe, sel þar him þuhte,
Onfest Radestone, þer he bock radde.

   -- 3    From the Tagelied of Wolfram von Eschenbach (Middle High German):
Sîne klâwen durh die wolken sint geslagen,
er stîget ûf mit grôzer kraft,
ich sih in grâwen tägelîch als er wil tagen,
den tac, der im geselleschaft
erwenden wil, dem werden man,
den ich mit sorgen în verliez.
ich bringe in hinnen, ob ich kan.
sîn vil manegiu tugent michz leisten hiez.

   --    Item Stats:  Item No:           5             Query Size (Chars):  74
   --                 Result Columns:    3             Result Rows:         3
   --                 VM Work Steps:     35            Rows Modified:       0
   --                 Full Query Time:   0d 00h 00m and 00.001s
   --                 Query Result:      Success.
   -- ------------------------------------------------------------------------------------------------


SELECT "пустынных"
FROM "Garçons" WHERE "пустынных" LIKE '%ᚠᚱᚩᚠᚢᚱ%';

   -- пустынных
   -- ---------
   -- ᚠᛇᚻ᛫ᛒᛦᚦ᛫ᚠᚱᚩᚠᚢᚱ᛫ᚠᛁᚱᚪ᛫ᚷᛖᚻᚹᛦᛚᚳᚢᛗ
ᛋᚳᛖᚪᛚ᛫ᚦᛖᚪᚻ᛫ᛗᚪᚾᚾᚪ᛫ᚷᛖᚻᚹᛦᛚᚳ᛫ᛗᛁᚳᛚᚢᚾ᛫ᚻᛦᛏ᛫ᛞᚫᛚᚪᚾ
ᚷᛁᚠ᛫ᚻᛖ᛫ᚹᛁᛚᛖ᛫ᚠᚩᚱ᛫ᛞᚱᛁᚻᛏᚾᛖ᛫ᛞᚩᛗᛖᛋ᛫ᚻᛚᛇᛏᚪᚾ᛬

   --    Item Stats:  Item No:           6             Query Size (Chars):  71
   --                 Result Columns:    1             Result Rows:         1
   --                 VM Work Steps:     23            Rows Modified:       0
   --                 Full Query Time:   -- --- --- --- --.----
   --                 Query Result:      Success.
   -- ------------------------------------------------------------------------------------------------


SELECT "пустынных"
FROM "Garçons" WHERE "пустынных" LIKE '%tägelîch%';

   -- пустынных
   -- ---------
   -- Sîne klâwen durh die wolken sint geslagen,
er stîget ûf mit grôzer kraft,
ich sih in grâwen tägelîch als er wil tagen,
den tac, der im geselleschaft
erwenden wil, dem werden man,
den ich mit sorgen în verliez.
ich bringe in hinnen, ob ich kan.
sîn vil manegiu tugent michz leisten hiez.

   --    Item Stats:  Item No:           7             Query Size (Chars):  73
   --                 Result Columns:    1             Result Rows:         1
   --                 VM Work Steps:     23            Rows Modified:       0
   --                 Full Query Time:   -- --- --- --- --.----
   --                 Query Result:      Success.
   -- ------------------------------------------------------------------------------------------------

   --   Script Stats: Total Script Execution Time:     0d 00h 00m and 00.044s
   --                 Total Script Query Time:         0d 00h 00m and 00.002s
   --                 Total Database Rows Changed:     3
   --                 Total Virtual-Machine Steps:     261
   --                 Last executed Item Index:        7
   --                 Last Script Error:
   -- ------------------------------------------------------------------------------------------------

   -- 2015-02-02 20:48:22.824  |  [Success]    Script Success.
   -- 2015-02-02 20:48:22.825  |  [Success]    Transaction Rolled back.
   -- -------  DB-Engine Logs (Contains logged information from all DB connections during run)  ------
   -- [2015-02-02 20:48:22.759] APPLICATION : Script E:\Documents\SQLiteScripts\UTF8_Test.sql started with Initialization at
20:48:22.759 on 02 February.
   -- ================================================================================================

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users