printf() problem padding multi-byte UTF-8 code points

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

printf() problem padding multi-byte UTF-8 code points

Ralf Junker
Example SQL:

select
   length(printf ('%4s', 'abc')),
   length(printf ('%4s', 'äöü')),
   length(printf ('%-4s', 'abc')),
   length(printf ('%-4s', 'äöü'))

Output is 4, 3, 4, 3. Padding seems to take into account UTF-8 bytes
instead of UTF-8 code points.

Should padding not work on code points and output 4 in all cases as
requested?

Ralf
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: printf() problem padding multi-byte UTF-8 code points

Richard Hipp-3
On 2/17/18, Ralf Junker <[hidden email]> wrote:

> Example SQL:
>
> select
>    length(printf ('%4s', 'abc')),
>    length(printf ('%4s', 'äöü')),
>    length(printf ('%-4s', 'abc')),
>    length(printf ('%-4s', 'äöü'))
>
> Output is 4, 3, 4, 3. Padding seems to take into account UTF-8 bytes
> instead of UTF-8 code points.
>
> Should padding not work on code points and output 4 in all cases as
> requested?

The current behavior of the printf() function in SQLite, goofy though
it may be, exactly mirrors the behavior of the printf() C function in
the standard library in this regard.

So I'm not sure whether or not this is something that ought to be "fixed".
--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: printf() problem padding multi-byte UTF-8 code points

J Decker
On Sat, Feb 17, 2018 at 3:36 PM, Richard Hipp <[hidden email]> wrote:

> On 2/17/18, Ralf Junker <[hidden email]> wrote:
> > Example SQL:
> >
> > select
> >    length(printf ('%4s', 'abc')),
> >    length(printf ('%4s', 'äöü')),
> >    length(printf ('%-4s', 'abc')),
> >    length(printf ('%-4s', 'äöü'))
> >
> > Output is 4, 3, 4, 3. Padding seems to take into account UTF-8 bytes
> > instead of UTF-8 code points.
> >
> > Should padding not work on code points and output 4 in all cases as
> > requested?
>
> The current behavior of the printf() function in SQLite, goofy though
> it may be, exactly mirrors the behavior of the printf() C function in
> the standard library in this regard.
>
> So I'm not sure whether or not this is something that ought to be "fixed".
>
the length() SQL function and other character functions (rtrim/ltrim)
attempt to deal with codepoints not bytes...

Maybe an added function something like  `u8length( string, count )`  which
returns bytes for count characters in a string.... that could be passed to
printf( "%-*s",  u8length( 'äöü' , 4 ),  'äöü' )



> --
> D. Richard Hipp
> [hidden email]
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: printf() problem padding multi-byte UTF-8 code points

Peter da Silva
In reply to this post by Richard Hipp-3
On 2018-02-17, at 17:36, Richard Hipp <[hidden email]> wrote:
> The current behavior of the printf() function in SQLite, goofy though
> it may be, exactly mirrors the behavior of the printf() C function in
> the standard library in this regard.
>
> So I'm not sure whether or not this is something that ought to be "fixed".

Printf's handling of unicode is inconsistent in other ways, too. I suspect that there's still undefined behavior floating around in there too. Even wprintf isn't entirely unsurprising:

% env
...
LANG=en_US.UTF-8
...
% cat localized.c
#include <stdio.h>
#include <wchar.h>

int main() {
wprintf (L"'%4ls'\n", L"äöü");
}
% cc localized.c
% ./a.out
' ???'
% cat delocalized.c
#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main() {
setlocale(LC_ALL, "");
wprintf (L"'%4ls'\n", L"äöü");
}
% cc delocalized.c
% ./a.out
' äöü'
% uname -a
Darwin Stonehenge.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jan 11 22:59:40 PST 2018; root:xnu-3789.73.8~1/RELEASE_X86_64 x86_64

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: printf() problem padding multi-byte UTF-8 code points

Cezary H. Noweta
Hello,

On 2018-02-18 01:46, Peter Da Silva wrote:
> Printf's handling of unicode is inconsistent in other ways, too. I suspect that there's still undefined behavior floating around in there too. Even wprintf isn't entirely unsurprising:

You have supplied examples which are exchanged with each other and are
confirming ``unsuprisingness'':

> LANG=en_US.UTF-8

Ok - so your native environment locale is ``UTF-8''.

> % cat localized.c

Why that program is named ``localized'' if...

> [...]
> int main() {
> wprintf (L"'%4ls'\n", L"äöü");

... you are using "C" locale for LC_CTYPE? Behavior entirely
unsurprising: there is no conversion from L"äöü" using "C" LC_CTYPE.

> [...]
> % cat delocalized.c

Why that program is named ``delocalized'' if...

> [...]
> setlocale(LC_ALL, "");

... you are using native environment locale (``UTF-8'') for LC_CTYPE?
Behavior entirely unsurprising: there is conversion from L"äöü" using
"UTF-8" LC_CTYPE.

-- best regards

Cezary H. Noweta
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: printf() problem padding multi-byte UTF-8 code points

Dominique Pellé
In reply to this post by Richard Hipp-3
Richard Hipp <[hidden email]> wrote:

> On 2/17/18, Ralf Junker <[hidden email]> wrote:
>> Example SQL:
>>
>> select
>>    length(printf ('%4s', 'abc')),
>>    length(printf ('%4s', 'äöü')),
>>    length(printf ('%-4s', 'abc')),
>>    length(printf ('%-4s', 'äöü'))
>>
>> Output is 4, 3, 4, 3. Padding seems to take into account UTF-8 bytes
>> instead of UTF-8 code points.
>>
>> Should padding not work on code points and output 4 in all cases as
>> requested?
>
> The current behavior of the printf() function in SQLite, goofy though
> it may be, exactly mirrors the behavior of the printf() C function in
> the standard library in this regard.
>
> So I'm not sure whether or not this is something that ought to be "fixed".


For what it's worth, this is what bash does, which looks
consistent with SQLite:

$ printf '[%4s]\n' 'abc'
[ abc]
$ printf '[%4s]\n' 'äöü'
[äöü]
$ printf '[%-4s]\n' 'abc'
[abc ]
$ printf '[%-4s]\n' 'äöü'
[äöü]

Perl does the same:

$ perl -e 'printf("[%4s]\n", "äöü")'
[äöü]

Vim printf() function does the same, but vim also
has a more convenient %S not present in the C printf(),
see :help printf()

          %s    string
          %6S    string right-aligned in 6 display cells
          %6s    string right-aligned in 6 bytes
          %.9s    string truncated to 9 bytes

:echo printf('[%4s]', 'äöü')
[äöü]
:echo printf('[%4S]', 'äöü')
:[ äöü]

Perhaps SQLite could add %S along those lines.
After all, SQLite already added "%q", "%Q", "%w"
and "%z" which are not present in the C printf().

Vim uses the number of display cells (not number of
code points). East Asian characters generally take
twice the size of Latin characters on screen, and
such characters take 2 cells on screen. Vim also
provides functions to find string length in bytes strlen(),
in display cells strwidth() and number of characters
strchars():

:echo strlen('äöü')
6
:echo strwidth('äöü')
3
:echo strchars('äöü')
3

With a more interesting string containing
East Asian characters:

:echo strlen('äöü中文')
12
:echo strwidth('äöü中文')
7
:echo strchars('äöü中文')
5

Regards
Dominique
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: printf() problem padding multi-byte UTF-8 code points

Ralf Junker
In reply to this post by Richard Hipp-3
On 18.02.2018 00:36, Richard Hipp wrote:

> The current behavior of the printf() function in SQLite, goofy though
> it may be, exactly mirrors the behavior of the printf() C function in
> the standard library in this regard.

SQLite3 is not C. SQLite3 text storage is always Unicode. Thus SQL text
processing functions should work on Unicode. The current implementation
of the SQLite3 SQL printf() can not reliably be used for string padding.
And there is no simple alternative, AFAICS.

PostgreSQL returns 4 in all cases:

select
    length(format ('%4s', 'abc')),
    length(format ('%4s', 'äöü')),
    length(format ('%-4s', 'abc')),
    length(format ('%-4s', 'äöü'))

MySQL has lpad() and rpad() to achieve the same and also returns 4 in
all cases:

select
    length(lpad ('abc', 4, ' ')),
    length(lpad ('äöü', 4, ' ')),
    length(rpad ('abc', 4, ' ')),
    length(rpad ('äöü', 4, ' '))

I strongly believe that SQLite3 should follow suit.

Ralf
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: printf() problem padding multi-byte UTF-8 code points

Rowan Worth-2
In reply to this post by Ralf Junker
What is your expected answer for:

select length(printf ('%4s', 'です'))

-Rowan

On 18 February 2018 at 01:39, Ralf Junker <[hidden email]> wrote:

> Example SQL:
>
> select
>   length(printf ('%4s', 'abc')),
>   length(printf ('%4s', 'äöü')),
>   length(printf ('%-4s', 'abc')),
>   length(printf ('%-4s', 'äöü'))
>
> Output is 4, 3, 4, 3. Padding seems to take into account UTF-8 bytes
> instead of UTF-8 code points.
>
> Should padding not work on code points and output 4 in all cases as
> requested?
>
> Ralf
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: printf() problem padding multi-byte UTF-8 code points

Ralf Junker
On 19.02.2018 09:50, Rowan Worth wrote:

> What is your expected answer for:
>
> select length(printf ('%4s', 'です'))

'です' are 2 codepoints according to

   http://www.fontspace.com/unicode/analyzer/?q=%E3%81%A7%E3%81%99

The requested overall width is 4, so I would expect expect two added
spaces and a total length of 4.

Ralf

PS: SQLite3 returns 2, which is less than the requested width.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: printf() problem padding multi-byte UTF-8 code points

Cezary H. Noweta
In reply to this post by Richard Hipp-3
Hello,

On 2018-02-18 00:36, Richard Hipp wrote:
> The current behavior of the printf() function in SQLite, goofy though
> it may be, exactly mirrors the behavior of the printf() C function in
> the standard library in this regard.

> So I'm not sure whether or not this is something that ought to be "fixed".

For the sake of sanity, such exception would be considered. I.e.
``length'' specification could mean number of ``multibyte characters''
-- not ``characters''. A C programmer has a chance to put all his
buffer, especially that there are no special provisions on multibyte
characters in the buffer (i.e. it must not begin nor end with an initial
shift state): for ( i = 0; len > i; i += 5 ) printf("%-5.5s", &s[i]); --
a bit non-sense but illustrates the problem.

On the other hand, SQLite's SQL has no access to memory buffers. In such
case, the C standard handles the situation (look at the end of ``s''
conversion specifier together with ``l'' flag): ``In no case is a
partial multibyte character written.''.

Is there somebody who things about a byte content of buffers, when he is
writing a software at a SQL level?

-- best regards

Cezary H. Noweta
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: printf() problem padding multi-byte UTF-8 code points

J Decker
On Mon, Feb 19, 2018 at 3:21 AM, Cezary H. Noweta <[hidden email]>
wrote:

> Hello,
>
> On 2018-02-18 00:36, Richard Hipp wrote:
>
>> The current behavior of the printf() function in SQLite, goofy though
>> it may be, exactly mirrors the behavior of the printf() C function in
>> the standard library in this regard.
>>
>
> So I'm not sure whether or not this is something that ought to be "fixed".
>>
>
> For the sake of sanity, such exception would be considered. I.e.
> ``length'' specification could mean number of ``multibyte characters'' --
> not ``characters''. A C programmer has a chance to put all his buffer,
> especially that there are no special provisions on multibyte characters in
> the buffer (i.e. it must not begin nor end with an initial shift state):
> for ( i = 0; len > i; i += 5 ) printf("%-5.5s", &s[i]); -- a bit non-sense
> but illustrates the problem.
>
> On the other hand, SQLite's SQL has no access to memory buffers. In such
> case, the C standard handles the situation (look at the end of ``s''
> conversion specifier together with ``l'' flag): ``In no case is a partial
> multibyte character written.''.
>
> Is there somebody who things about a byte content of buffers, when he is
> writing a software at a SQL level?

everyone dealing with padding/precision using printf() ?


>
>
> -- best regards
>
> Cezary H. Noweta
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: printf() problem padding multi-byte UTF-8 code points

J Decker
In reply to this post by Ralf Junker
On Mon, Feb 19, 2018 at 2:54 AM, Ralf Junker <[hidden email]> wrote:

> On 19.02.2018 09:50, Rowan Worth wrote:
>
> What is your expected answer for:
>>
>> select length(printf ('%4s', 'です'))
>>
>
> 'です' are 2 codepoints according to
>
>   http://www.fontspace.com/unicode/analyzer/?q=%E3%81%A7%E3%81%99
>
> The requested overall width is 4, so I would expect expect two added
> spaces and a total length of 4.
>
> Ralf
>
> PS: SQLite3 returns 2, which is less than the requested width.


Okay; but the functions in other databases weren't printf.  Because it is a
mimic of the C function of the same name, I would expect the count to be
bytes...
(v)(s)(n)printf, sscanf unfortunatly don't know rune like Go.
Although fprintf, I might expect to understand locale and UTF8 or other
wide encodings when writing to a fopen( ..., 't' ) type file... (probably
not even then though, since I think fprintf is vsnprintf to a buffer which
is then passed to fwrite or fputs.... which then it's probably bytes.

Changing the function is bound to break things, and it wouldn't be a small
task to reimplement a C library as utf8.

the SQL functions (that are not C emulations) do work in codepoints and not
bytes (for the most part; they break unnecessarily on NUL characters, which
is non SQL compliant.... ).

Could make a function to do the same job, but correctly; but even so; you'd
have to find a utf8 printf;
https://stackoverflow.com/questions/9325487/looking-for-utf8-aware-formatting-functions-like-printf-etc
not a lot of help; but maybe worth mentioning
" Just a warning, counting "characters" in Unicode data is quite a
complicated business. Besides the fact that each code point in UTF-8 is
composed of several bytes, each glyph (or "grapheme") can be composed of
several code points, and for that reason fwprintf is inadequate for
truncating Unicode data anyway -- for example you could cut off an accent
without cutting off the character it applies to. So whatever you end up
using, make sure that the meaning of the length you specify is clear to you.
 – Steve Jessop <https://stackoverflow.com/users/13005/steve-jessop> Feb 17
'12 at 9:20
<https://stackoverflow.com/questions/9325487/looking-for-utf8-aware-formatting-functions-like-printf-etc#comment11766824_9325487>
 "
"
"
"

I'm not finding anything; everyone recommends using different ways to do it
( use a unicode library, which doesn't have a printf) or do it in another
language - use String type or something....









>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: printf() problem padding multi-byte UTF-8 code points

petern
In reply to this post by Ralf Junker
As d3ck0r suggested.  adding a byte_length() function would enable padding
of spaces [but not general padding with arbitrary characters as lpad() and
rpad() afford].

WITH points(p) AS (VALUES ('abc'), ('äöü'), ('です'))
,format(f) AS (VALUES ('%*s'), ('%-*s'))
,pad AS (SELECT p, f, printf(f,byte_length(p)+(4-length(p)),p)pad FROM
points CROSS JOIN format)
SELECT p,f,pad,length(pad)len FROM pad;

'p','f','pad','len'
'abc','%*s',' abc',4
'abc','%-*s','abc ',4
'äöü','%*s',' äöü',4
'äöü','%-*s','äöü ',4
'です','%*s','  です',4
'です','%-*s','です  ',4

A new byte_length() function is a great idea but for getting action on
publishing it and the requisite help page entry.
I recently asked to add 1 protection source line in the eval() function
against segmentation fault but got neither action nor reply.
Experience suggests you will have to add the 3 source lines to your local
copy of SQLite if you must to pad strings containing high code points:

static void byte_length(sqlite3_context *context, int argc, sqlite3_value
**argv) {
  sqlite3_result_int(context, sqlite3_value_bytes(argv[0]));
}

Peter





On Mon, Feb 19, 2018 at 12:43 AM, Ralf Junker <[hidden email]> wrote:

> On 18.02.2018 00:36, Richard Hipp wrote:
>
> The current behavior of the printf() function in SQLite, goofy though
>> it may be, exactly mirrors the behavior of the printf() C function in
>> the standard library in this regard.
>>
>
> SQLite3 is not C. SQLite3 text storage is always Unicode. Thus SQL text
> processing functions should work on Unicode. The current implementation
> of the SQLite3 SQL printf() can not reliably be used for string padding.
> And there is no simple alternative, AFAICS.
>
> PostgreSQL returns 4 in all cases:
>
> select
>    length(format ('%4s', 'abc')),
>    length(format ('%4s', 'äöü')),
>    length(format ('%-4s', 'abc')),
>    length(format ('%-4s', 'äöü'))
>
> MySQL has lpad() and rpad() to achieve the same and also returns 4 in
> all cases:
>
> select
>    length(lpad ('abc', 4, ' ')),
>    length(lpad ('äöü', 4, ' ')),
>    length(rpad ('abc', 4, ' ')),
>    length(rpad ('äöü', 4, ' '))
>
> I strongly believe that SQLite3 should follow suit.
>
> Ralf
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: printf() problem padding multi-byte UTF-8 code points

Jens Alfke-2
In reply to this post by Ralf Junker


> On Feb 19, 2018, at 2:54 AM, Ralf Junker <[hidden email]> wrote:
>
> 'です' are 2 codepoints according to
>
>  http://www.fontspace.com/unicode/analyzer/?q=%E3%81%A7%E3%81%99 <http://www.fontspace.com/unicode/analyzer/?q=%E3%81%A7%E3%81%99>
>
> The requested overall width is 4, so I would expect expect two added spaces and a total length of 4.

If this is being done for the purpose of visual alignment in a monospaced font, it's not going to work. Both of those Kanji(?) characters are displayed as double-width (in macOS's Terminal at least), so their visual width is 4 spaces, meaning there should be zero spaces of padding.

You really _cannot_ equate Unicode code-points with visual width of displayed text, even in a monospaced layout. Not only do terminals render some characters as double-width, but there are all kinds of other exceptions like zero-width joiners, diacritical marks, ligatures, and joined forms. As a very common example of the latter, many emojis — e.g. all the faces with multiple skin tones — are actually composed of multiple (up to five or six) Unicode code-points.

TL;DR: If you use character (code-point) counts to visually lay out text, you're likely to get bad results with anything other than plain ASCII, so it's only marginally better than just counting bytes.

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: printf() problem padding multi-byte UTF-8 code points

Keith Medcalf
In reply to this post by Ralf Junker

Should not your application just retrieve the UTF-8 text and format it for display to the user?  User <-> Software formatting (and input/output diddling of any type) should only be done ONCE (on INPUT from the user or on OUTPUT to the user) as close to the User as possible and should *NEVER EVER* be done as an intermediate step that is used for any other purpose (that originating from or terminating with a user).

It is neither possible nor expected for a "Data Storage System" to know about the local foibles of the user -- that is an application programming (UI) issue.  "Data Storage Systems" store data in a user-foible-free state.  Date/Time are UTC ISO8601, text (encoded) is just a bunch of character units stored side-by-each, blobs are just a sequence of bytes stored side-by-each, integers are, well, integers in binary format; and, floating point is stored in floating point format (as an approximation to a value).

Placement of comma's/decimal points, display precision of floating point numbers, formatting of dates and "bag-o-bytes" (encoded text or blobs) are UI issues and not properly a part of the "Data Storage System".

That SQLite contains a "printf" function is quaint, but it is merely quaint and should not be expected to provide the same capabilities as a "proper" UI.

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.


>-----Original Message-----
>From: sqlite-users [mailto:sqlite-users-
>[hidden email]] On Behalf Of Ralf Junker
>Sent: Saturday, 17 February, 2018 10:40
>To: [hidden email]
>Subject: [sqlite] printf() problem padding multi-byte UTF-8 code
>points
>
>Example SQL:
>
>select
>   length(printf ('%4s', 'abc')),
>   length(printf ('%4s', 'äöü')),
>   length(printf ('%-4s', 'abc')),
>   length(printf ('%-4s', 'äöü'))
>
>Output is 4, 3, 4, 3. Padding seems to take into account UTF-8 bytes
>instead of UTF-8 code points.
>
>Should padding not work on code points and output 4 in all cases as
>requested?
>
>Ralf
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: printf() problem padding multi-byte UTF-8 code points

Cezary H. Noweta
In reply to this post by Ralf Junker
Hello,

On 2018-02-17 18:39, Ralf Junker wrote:

> Example SQL:
>
> select
>    length(printf ('%4s', 'abc')),
>    length(printf ('%4s', 'äöü')),
>    length(printf ('%-4s', 'abc')),
>    length(printf ('%-4s', 'äöü'))
>
> Output is 4, 3, 4, 3. Padding seems to take into account UTF-8 bytes
> instead of UTF-8 code points.
>
> Should padding not work on code points and output 4 in all cases as
> requested?

If you are interested in a patch extending a functionality of
``printf()'' then http://sqlite.chncc.eu/utf8printf/. Adding ``l''
length modifier makes width/precision specifications being treated as
numbers of UTF-8 chars -- not bytes. ``SELECT length(printf ('%4ls',
'äöü'));'' will give 4.

-- best regards

Cezary H. Noweta
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: printf() problem padding multi-byte UTF-8 code points

petern
FYI.  See http://www.sqlite.org/src/timeline for the equivalent DRH
checkins:  http://www.sqlite.org/src/info/c883c4d33f4cd722
Hopefully that branch will make a forthcoming trunk merge.   [Printing
explicit nul terminator by formatting an interesting twist.]

Yet even so, as Ralf pointed out, the PostgreSQL lpad() and rpad() fill
with arbitrary string functionality would still be missing despite the
checked in printf() being more directly equivalent to the PostgreSQL
format() function.  First things first I suppose...

PostgreSQL lpad() and rpad() documentation is here:
https://www.postgresql.org/docs/9.5/static/functions-string.html

Peter

On Mon, Feb 19, 2018 at 4:38 PM, Cezary H. Noweta <[hidden email]>
wrote:

> Hello,
>
> On 2018-02-17 18:39, Ralf Junker wrote:
>
>> Example SQL:
>>
>> select
>>    length(printf ('%4s', 'abc')),
>>    length(printf ('%4s', 'äöü')),
>>    length(printf ('%-4s', 'abc')),
>>    length(printf ('%-4s', 'äöü'))
>>
>> Output is 4, 3, 4, 3. Padding seems to take into account UTF-8 bytes
>> instead of UTF-8 code points.
>>
>> Should padding not work on code points and output 4 in all cases as
>> requested?
>>
>
> If you are interested in a patch extending a functionality of ``printf()''
> then http://sqlite.chncc.eu/utf8printf/. Adding ``l'' length modifier
> makes width/precision specifications being treated as numbers of UTF-8
> chars -- not bytes. ``SELECT length(printf ('%4ls', 'äöü'));'' will give 4.
>
> -- best regards
>
> Cezary H. Noweta
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: printf() problem padding multi-byte UTF-8 code points

J Decker
On Mon, Feb 19, 2018 at 5:38 PM, petern <[hidden email]> wrote:

> FYI.  See http://www.sqlite.org/src/timeline for the equivalent DRH
> checkins:  http://www.sqlite.org/src/info/c883c4d33f4cd722
> Hopefully that branch will make a forthcoming trunk merge.   [Printing
> explicit nul terminator by formatting an interesting twist.]
>
> @DRH
printf( "whatever%ctest", 0 ); should result with that character in the
string
int length = snprintf( buf, 256, "whatever%ctest", 0 );

length == 13 while yes, applying strlen to the same buffer will result in
only 8 as the length.


> Yet even so, as Ralf pointed out, the PostgreSQL lpad() and rpad() fill
> with arbitrary string functionality would still be missing despite the
> checked in printf() being more directly equivalent to the PostgreSQL
> format() function.  First things first I suppose...
>
> PostgreSQL lpad() and rpad() documentation is here:
> https://www.postgresql.org/docs/9.5/static/functions-string.html
>
> Peter
>
> On Mon, Feb 19, 2018 at 4:38 PM, Cezary H. Noweta <[hidden email]>
> wrote:
>
> > Hello,
> >
> > On 2018-02-17 18:39, Ralf Junker wrote:
> >
> >> Example SQL:
> >>
> >> select
> >>    length(printf ('%4s', 'abc')),
> >>    length(printf ('%4s', 'äöü')),
> >>    length(printf ('%-4s', 'abc')),
> >>    length(printf ('%-4s', 'äöü'))
> >>
> >> Output is 4, 3, 4, 3. Padding seems to take into account UTF-8 bytes
> >> instead of UTF-8 code points.
> >>
> >> Should padding not work on code points and output 4 in all cases as
> >> requested?
> >>
> >
> > If you are interested in a patch extending a functionality of
> ``printf()''
> > then http://sqlite.chncc.eu/utf8printf/. Adding ``l'' length modifier
> > makes width/precision specifications being treated as numbers of UTF-8
> > chars -- not bytes. ``SELECT length(printf ('%4ls', 'äöü'));'' will give
> 4.
> >
> > -- best regards
> >
> > Cezary H. Noweta
> > _______________________________________________
> > sqlite-users mailing list
> > [hidden email]
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: printf() problem padding multi-byte UTF-8 code points

Simon Slavin-3
In reply to this post by petern
On 20 Feb 2018, at 1:38am, petern <[hidden email]> wrote:

> Yet even so, as Ralf pointed out, the PostgreSQL lpad() and rpad() fill
> with arbitrary string functionality would still be missing despite the
> checked in printf() being more directly equivalent to the PostgreSQL
> format() function.  First things first I suppose...
>
> PostgreSQL lpad() and rpad() documentation is here:
> https://www.postgresql.org/docs/9.5/static/functions-string.html

The problem with string length and padding was pointed out upthread.  Padding strings to a length was useful in the days of fixed-width fonts.  We don't do that much these days.  And even if you could equip SQLite with functions which did those things, to do it properly you'd need routines which understood Unicode characters, combinations, accents and the sort of diacritics used for Hebrew and Arabic vowels.  Without that, you fancy new feature is just going to trigger hundreds of bug reports.

String width functions now days take two parameters, the string (in some flavour of Unicode) and a font descriptor (font, size, emphasis and other options) and return the width of the string in points, taking into account not only Unicode features but font features like kern hinting and ligatures.  And you will find these features in your operating system.

So please, folks, don't try to do this in a purposely tiny DBMS.  Do it using OS calls, as the people who designed your OS intended.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: printf() problem padding multi-byte UTF-8 code points

petern
There are other uses for padding strings besides user reports.  Consider
scalar representations of computations for example. Also:

1.There was no mention of user display formatting in Ralf's original
report.  It was a bug report about missing inverse functionality for
padding/trimming strings.
2.The proposed functions fully exist in the PostgreSQL archetype.  Is
PostgreSQL wrong?
3. Why can't SQLite have the expected common static SQL functions for
getting rapid development done without external tools?
Is the goal to reduce SQL portability and increase development effort just
to see some representative output results?

I don't think anybody is trying to create production grade displays within
SQL but being able to produce representative output and having the expected
nucleus of built-in SQL functions (including canonical inverses) is still a
sensible goal.

On Mon, Feb 19, 2018 at 6:06 PM, Simon Slavin <[hidden email]> wrote:

> On 20 Feb 2018, at 1:38am, petern <[hidden email]> wrote:
>
> > Yet even so, as Ralf pointed out, the PostgreSQL lpad() and rpad() fill
> > with arbitrary string functionality would still be missing despite the
> > checked in printf() being more directly equivalent to the PostgreSQL
> > format() function.  First things first I suppose...
> >
> > PostgreSQL lpad() and rpad() documentation is here:
> > https://www.postgresql.org/docs/9.5/static/functions-string.html
>
> The problem with string length and padding was pointed out upthread.
> Padding strings to a length was useful in the days of fixed-width fonts.
> We don't do that much these days.  And even if you could equip SQLite with
> functions which did those things, to do it properly you'd need routines
> which understood Unicode characters, combinations, accents and the sort of
> diacritics used for Hebrew and Arabic vowels.  Without that, you fancy new
> feature is just going to trigger hundreds of bug reports.
>
> String width functions now days take two parameters, the string (in some
> flavour of Unicode) and a font descriptor (font, size, emphasis and other
> options) and return the width of the string in points, taking into account
> not only Unicode features but font features like kern hinting and
> ligatures.  And you will find these features in your operating system.
>
> So please, folks, don't try to do this in a purposely tiny DBMS.  Do it
> using OS calls, as the people who designed your OS intended.
>
> Simon.
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
12