How can custom tokenizer tell it's parsing a search string?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

How can custom tokenizer tell it's parsing a search string?

Jens Alfke-2
Is there any way for a custom FTS4 tokenizer to know when it’s tokenizing a search string (the argument of a MATCH expression), as opposed to text to be indexed?

Here’s my problem: I’ve implemented a custom tokenizer that skips “stop words” (noise words, like “the” and “a” in English.) It works well. But I’ve just gotten a bug report that some search strings with wild-cards don’t work. For example, “mo* AND the*” would be expected to match text containing the words “Moog” and “theremin”, but instead the query fails with the SQLite error "malformed MATCH expression: [mo* AND the*]”.

The reason for the error is that when the query runs, FTS4 uses my tokenizer to break the search string into words. My tokenizer skips “the” because it’s a stop word, so the sequence of tokens FTS4 gets is “mo”, “*”, “AND”, “*” … which is invalid since there’s no prefix before the second “*”.

I can fix this by preserving stop-words when the tokenizer is being used to scan the search string. But I can’t find any way for the tokenizer to tell the difference! It’s the same tokenizer instance used for indexing, and the SQLite function getNextToken opens it in the normal way and calls its xNext function.

The best workaround I can think of is to make the tokenizer preserve a stop-word when it’s followed by a “*” … but there are contexts where this can happen in regular text being indexed, when the “*” is a footnote marker or the end of a Markdown emphasis sequence.

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How can custom tokenizer tell it's parsing a search string?

Dan Kennedy-4
On 12/12/2018 03:37 AM, Jens Alfke wrote:

> Is there any way for a custom FTS4 tokenizer to know when it’s
> tokenizing a search string (the argument of a MATCH expression), as
> opposed to text to be indexed?
>
> Here’s my problem: I’ve implemented a custom tokenizer that skips
> “stop words” (noise words, like “the” and “a” in English.) It works
> well. But I’ve just gotten a bug report that some search strings with
> wild-cards don’t work. For example, “mo* AND the*” would be expected
> to match text containing the words “Moog” and “theremin”, but instead
> the query fails with the SQLite error "malformed MATCH expression:
> [mo* AND the*]”.
>
> The reason for the error is that when the query runs, FTS4 uses my
> tokenizer to break the search string into words. My tokenizer skips
> “the” because it’s a stop word, so the sequence of tokens FTS4 gets
> is “mo”, “*”, “AND”, “*” … which is invalid since there’s no prefix
> before the second “*”.
>
> I can fix this by preserving stop-words when the tokenizer is being
> used to scan the search string. But I can’t find any way for the
> tokenizer to tell the difference! It’s the same tokenizer instance
> used for indexing, and the SQLite function getNextToken opens it in
> the normal way and calls its xNext function.


I don't think there is any way to tell with FTS3/4. FTS5 passes a
parameter to the tokenizer to indicate this (the mask of
FTS5_TOKENIZER_* flags), but FTS3/4 does not. But you wouldn't have this
problem with FTS5 anyhow, because it handles the AND or "*" syntax
before passing whatever is left to the tokenizer.

   https://sqlite.org/fts5.html#custom_tokenizers

Leaving stop words in while parsing queries won't quite work anyway. If
your tokenizer returns "the" when parsing a query, FTS3/4 will search
for "the" in the index. And it won't be there if the tokenizer used for
parsing documents stripped it out.

I think your best options might be to switch to FTS5 or to write a
tokenizer smart enough to remove the AND or other syntax tokens when
required.

Dan.




> The best workaround I can think of is to make the tokenizer preserve
> a stop-word when it’s followed by a “*” … but there are contexts
> where this can happen in regular text being indexed, when the “*” is
> a footnote marker or the end of a Markdown emphasis sequence.
>
> —Jens _______________________________________________ sqlite-users
> mailing list [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: How can custom tokenizer tell it's parsing a search string?

Jens Alfke-2
Thanks for the reply, Dan!

> On Dec 12, 2018, at 7:08 AM, Dan Kennedy <[hidden email]> wrote:
>
> Leaving stop words in while parsing queries won't quite work anyway. If your tokenizer returns "the" when parsing a query, FTS3/4 will search for "the" in the index. And it won't be there if the tokenizer used for parsing documents stripped it out.

I was only talking about leaving them in when followed immediately by a “*” — so it would preserve “the*” but not “the”. Then FTS4 will interpret “the*” as a prefix match, not the word “the”.

> I think your best options might be to switch to FTS5

I haven’t looked into how hard it would be to switch to FTS5. I recall that when I started writing this code a few years ago, FTS5 had some issues or limitations that led me to use FTS4 instead.

Also, there are by now many databases out in the field that have FTS4 tables/indexes in them. If I switch to FTS5 will those be upgraded, or do I need to do so manually?

>  or to write a tokenizer smart enough to remove the AND or other syntax tokens when required.

Not sure what you mean by this — the “when required” part is the sticking point, which is the reason I posted.

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users