Quantcast

FTS3 tokenize unicode61 does not remove diacritics correctly?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

FTS3 tokenize unicode61 does not remove diacritics correctly?

Artur Król-2
Hi all,

I have an issue with FTS3 (http://www.sqlite.org/fts3.html).

I am creating virtual table using fts3 to query tokens:
CREATE VIRTUAL TABLE tok1 USING fts3tokenize(unicode61);
Documentation says:
„By default, "unicode61" also removes all diacritics from Latin script characters.”;

When I use query to select tokens:
SELECT token FROM tok1 WHERE input='ęóąłżźćńĘÓĄŁŻŹĆŃlŁ*';

The result is:
eoałzzcneoałzzcnlł

It seems diacritics from letter „ł” and „Ł” was not removed. Is it a sqlite bug?

Regards,
Artur Król


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: FTS3 tokenize unicode61 does not remove diacritics correctly?

Cezary H. Noweta
Hello,

On 2017-02-16 10:53, [hidden email] wrote:
> [...]
> The result is:
> eoałzzcneoałzzcnlł

> It seems diacritics from letter „ł” and „Ł” was not removed. Is it a sqlite bug?

In general, overlays (slash, crossbars, etc.) are considered as
diacritics, however, Unicode does not provide a decomposition mapping
for ``ł'', or ``Ł''. Even if it is a bug, then it will concern the
Unicode standard rather then SQLite FTS3 itself, as the latter is using
the character database provided by the Unicode standard.

-- best regards

Cezary H. Noweta
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Loading...