FTS5 minimum number of characters to index ?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

FTS5 minimum number of characters to index ?

Domingo Alvarez Duarte
Hello !

I'm looking in the documentation and it doesn't seem to mention any
option to specify a minimum number of characters to index, looking at
some fts5 tables it seems that an option to limit the minimum number of
characters to at least 2 or 3 would be a good shot as stopwords, another
interest option would be a regex like black/white list of sequence of
characters to be indexed.

Something like:

create virtual table if not exists pdfs_fts using fts5(pdf_name
UNINDEXED, data,

     tokenize = 'unicode61 remove_diacritics 1 min_word_size 3
word_black_list [\d\.\d\d\w \a\d\d\d] word_white_list [\(\d+\)
\d\d\.\d\d\d\.\d\d\a]');

The idea is to allow/disallow some specific domain sequences to be
included/excluded from indexing.

Any idea on how to obtain that ?

Cheers !

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: FTS5 minimum number of characters to index ?

Jens Alfke-2


> On Sep 21, 2018, at 3:26 AM, Domingo Alvarez Duarte <[hidden email]> wrote:
>
> looking at some fts5 tables it seems that an option to limit the minimum number of characters to at least 2 or 3 would be a good shot as stopwords,

A real stop-word list is valuable, but I don’t think a simple minimum-length rule would be as useful. Maybe in a few contexts, but not in general. (It’s not useful even for English text; for example, I’m very glad that Google indexes the word “C” so I can look up questions about C programming!)

> another interest option would be a regex like black/white list of sequence of characters to be indexed.

You can do all this and more with a custom tokenizer :)

(Most real-world uses of FTS for natural language text will end up needing a custom tokenizer anyway, because IIRC the default tokenizer is very stupid and only breaks at whitespace. At a minimum you need one that can ignore inter-word punctuation like periods and commas, and recognize some non-ASCII characters like curly quotes and en-dashes.

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users