FTS tokenize=unicode61: "full" or "simple" case folding?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

FTS tokenize=unicode61: "full" or "simple" case folding?

Tomash Brechko
Hello,

https://www.sqlite.org/fts3.html#tokenizer page says that unicode61
tokenizer implements _full_ case folding (it doesn't emphasize the word,
but it's there).  ftp://unicode.org/Public/6.1.0/ucd/CaseFolding.txt has
the following rules:

-- cut --
...
00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
...
1E9E; F; 0073 0073; # LATIN CAPITAL LETTER SHARP S
1E9E; S; 00DF; # LATIN CAPITAL LETTER SHARP S
...
-- cut --

I.e. in _full_ case folding both "ẞ" (U+1E9E) and "ß" (U+00DF) are mapped
to "ss", whereas in _simple_ case folding first one is mapped to the
second.  SQLite 3.11.0 works according to simple rules:

-- cut --
CREATE VIRTUAL TABLE t USING fts3tokenize(unicode61);
SELECT token FROM t WHERE input = "ẞ ß";
-- cut --
gives
-- cut--
ß
ß
-- cut--

So which one is correct, documentation or implementation?  I also wonder
what a native German speaker would expect in full-text search case?
(Google gives different result counts for "Schloß" and "Schloss", which
actually surprises me a bit).

--
  Tomash Brechko
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: FTS tokenize=unicode61: "full" or "simple" case folding?

Richard Hipp-3
On 3/21/16, Tomash Brechko <[hidden email]> wrote:
> Hello,
>
> https://www.sqlite.org/fts3.html#tokenizer page says that unicode61
> tokenizer implements _full_ case folding (it doesn't emphasize the word,
> but it's there).

That is a documentation error.  It has now been fixed.  Thanks.

Probably the error originates with our thinking of "full" in the sense
of "more than just ASCII", rather than the official unicode definition
of "full" which is "it can possibly change the number of code points".

--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: FTS tokenize=unicode61: "full" or "simple" case folding?

Matthias-Christian Ott
In reply to this post by Tomash Brechko
On 2016-03-21 20:43, Tomash Brechko wrote:

> Hello,
>
> https://www.sqlite.org/fts3.html#tokenizer page says that unicode61
> tokenizer implements _full_ case folding (it doesn't emphasize the word,
> but it's there).  ftp://unicode.org/Public/6.1.0/ucd/CaseFolding.txt has
> the following rules:
>
> -- cut --
> ...
> 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
> ...
> 1E9E; F; 0073 0073; # LATIN CAPITAL LETTER SHARP S
> 1E9E; S; 00DF; # LATIN CAPITAL LETTER SHARP S
> ...
> -- cut --
>
> I.e. in _full_ case folding both "ẞ" (U+1E9E) and "ß" (U+00DF) are mapped
> to "ss", whereas in _simple_ case folding first one is mapped to the
> second.  SQLite 3.11.0 works according to simple rules:
>
> -- cut --
> CREATE VIRTUAL TABLE t USING fts3tokenize(unicode61);
> SELECT token FROM t WHERE input = "ẞ ß";
> -- cut --
> gives
> -- cut--
> ß
> ß
> -- cut--
>
> So which one is correct, documentation or implementation?  I also wonder
> what a native German speaker would expect in full-text search case?
> (Google gives different result counts for "Schloß" and "Schloss", which
> actually surprises me a bit).

The character "ß" was often not present in fonts, is not included in
ISO/IEC 8859-1:1998 and is not historically and commonly used in German
(the German Wikipedia and the articles' references can explain this
better than I can). It was just "recently" added Unicode 5.1 in 2008. It
is common to either capitalize ß as SS or SZ (to avoid ambiguities) in
all-caps titles. I think it's uncertain whether ẞ will be widely used.

If I understand Unicode case folding correctly, it exists to be able to
compare Unicode strings case-insensitively by converting them into a
canonical form. So simple case seems correct, as ẞ would be folded to ß.
However, if you keep in mind the old orthography (before 1996) and want
to know what makes sense for a search engine, full case folding makes
more sense. As you noted "Schloß" and "Schloss" should return the same
results for non-verbatim searches as such distinction would seemingly
only be relevant to linguists or historians but not for every day use
and business information systems.

The Unicode standard is unfortunately vague about what it wants to
achieve by case folding and what thoughts went into the case folding
table. Perhaps you should ask on the Unicode mailing list.

You didn't describe your use-case but I would also generally advice to
use a phonetic algorithm for German to canonicalize words for
non-verbatim searches instead of case folding. It gives better results
and most German speakers I know appreciate the phonetic corrections of
popular Internet search engines for non-verbatim searches.

I hope this helps. Maybe it also helps to consult a linguist for
building a non-simplistic search engine for German. For example, you
have to perform compound splitting, stemming and some form of
grammatical analysis at some point.

- Matthias-Christian
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users