Some FTS5 guidance

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Some FTS5 guidance

Mario M. Westphal-2
Hello,

 

I recently looked into FTS 5.

The documentation is clear and I was able to get it running with a small
test database quickly. And the response times are awesome :-)

 

My question:

 

At least as I understand it at this point, FTS can only do prefix queries.

 

If my database contains the words

 

moon

moonlight

moonshine

shine

sunshine

 

A FTS query like "moon*" will find all three terms starting with "moon" -
very fast.

 

But there is no way to find "moonshine" or "sunshine" by running a query for
"shine" or "shine*" ?

 

Currently I search using LIKE and there such 'contains' queries are easy. My
users of course don't understand all this and want to find all words
containing shine, wherever the term appears in the word.

 

The only idea I had so far was to write my own tokenizer and to store each
word with every possible 'sub-word':

 

When "moonshine" is added to FTS, it is split into multiple words:



moonshine
oonshine
onshine
nshine
shine
.



(maybe I limit this to a minimum of 2 or 3 characters).

 

This of course produces a log of extra entries in FTS and may impact
performance and database size.

I hence wonder if this problem has been tackled already and if there is a
"standard" solution.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Some FTS5 guidance

Graham Holden
I've never used FTS, just throwing an off-the-wall idea out: instead of tokenising partial words, could you tokenise/store the reverse of each word (possibly in a separate place if that can be done):

enihsnoom
enihs
enihsnus

Then search for "enihs" as well as "shine". If you can't separate the forward and reversed versions, you'd have to filter-out when "dog" matches "god".

Graham

Sent from Samsung Mobile

-------- Original message --------
From: "Mario M. Westphal" <[hidden email]>
Date: 07/01/2016  18:31  (GMT+00:00)
To: [hidden email]
Subject: [sqlite] Some FTS5 guidance
 
Hello,



I recently looked into FTS 5.

The documentation is clear and I was able to get it running with a small
test database quickly. And the response times are awesome :-)



My question:



At least as I understand it at this point, FTS can only do prefix queries.



If my database contains the words



moon

moonlight

moonshine

shine

sunshine



A FTS query like "moon*" will find all three terms starting with "moon" -
very fast.



But there is no way to find "moonshine" or "sunshine" by running a query for
"shine" or "shine*" ?



Currently I search using LIKE and there such 'contains' queries are easy. My
users of course don't understand all this and want to find all words
containing shine, wherever the term appears in the word.



The only idea I had so far was to write my own tokenizer and to store each
word with every possible 'sub-word':



When "moonshine" is added to FTS, it is split into multiple words:



moonshine
oonshine
onshine
nshine
shine
.



(maybe I limit this to a minimum of 2 or 3 characters).



This of course produces a log of extra entries in FTS and may impact
performance and database size.

I hence wonder if this problem has been tackled already and if there is a
"standard" solution.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Some FTS5 guidance

Matthias-Christian Ott
In reply to this post by Mario M. Westphal-2
On 2016-01-07 19:31, Mario M. Westphal wrote:
> I hence wonder if this problem has been tackled already and if there is a
> "standard" solution.

If I understand you correctly, it seems that you are looking for a
compound splitting or decompounding algorithm. Unfortunately there is
not a "standard solution" for this. There are many languages in the
world and for some usable compound splitting algorithms exist. There are
also attempts to create statistical universal algorithms.

As you said, for English a simple sub-string search might suffice but
for other languages it more complex. I assume that you speak German. If
you have a document that contains the term "Verkehrsleitsystem" and your
search query is "Verkehr leiten", it's reasonable to assume that the
document is relevant to the search query. Unfortunately a sub-string
search could not find the document. Other languages are even more
difficult (a textbook on linguistics will explain this better than I can).

Even if you have such algorithm, it's not trivial to score the results
and there are more aspects to consider to create a simple search
algorithm. For example, in English you will also have to do some
analysis of the phrase structure to identify open compounds.

Perhaps it helps to mention the languages you are interested in and the
application you have in mind to evaluate whether the SQLite FTS5 could
meet your requirements.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Some FTS5 guidance

Stadin, Benjamin
One such algorithm would be a (generalized) Ukkonnen suffix tree (https://en.m.wikipedia.org/wiki/Ukkonen%27s_algorithm).
It allows you to search efficiently for substrings.
It would be possible to do some match weigthing based on match distance within words. But a general solution for a database is probably not trivial to implement.

Ben

Von meinem iPad gesendet

> Am 07.01.2016 um 21:46 schrieb Matthias-Christian Ott <[hidden email]>:
>
>> On 2016-01-07 19:31, Mario M. Westphal wrote:
>> I hence wonder if this problem has been tackled already and if there is a
>> "standard" solution.
>
> If I understand you correctly, it seems that you are looking for a
> compound splitting or decompounding algorithm. Unfortunately there is
> not a "standard solution" for this. There are many languages in the
> world and for some usable compound splitting algorithms exist. There are
> also attempts to create statistical universal algorithms.
>
> As you said, for English a simple sub-string search might suffice but
> for other languages it more complex. I assume that you speak German. If
> you have a document that contains the term "Verkehrsleitsystem" and your
> search query is "Verkehr leiten", it's reasonable to assume that the
> document is relevant to the search query. Unfortunately a sub-string
> search could not find the document. Other languages are even more
> difficult (a textbook on linguistics will explain this better than I can).
>
> Even if you have such algorithm, it's not trivial to score the results
> and there are more aspects to consider to create a simple search
> algorithm. For example, in English you will also have to do some
> analysis of the phrase structure to identify open compounds.
>
> Perhaps it helps to mention the languages you are interested in and the
> application you have in mind to evaluate whether the SQLite FTS5 could
> meet your requirements.
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Some FTS5 guidance

Charles Leifer
You can create a custom tokenizer as well then use the standard search
APIs. I imagine that functionality would work well in this case:
https://sqlite.org/fts5.html#section_7

On Thu, Jan 7, 2016 at 3:59 PM, Stadin, Benjamin <
[hidden email]> wrote:

> One such algorithm would be a (generalized) Ukkonnen suffix tree (
> https://en.m.wikipedia.org/wiki/Ukkonen%27s_algorithm).
> It allows you to search efficiently for substrings.
> It would be possible to do some match weigthing based on match distance
> within words. But a general solution for a database is probably not trivial
> to implement.
>
> Ben
>
> Von meinem iPad gesendet
>
> > Am 07.01.2016 um 21:46 schrieb Matthias-Christian Ott <[hidden email]>:
> >
> >> On 2016-01-07 19:31, Mario M. Westphal wrote:
> >> I hence wonder if this problem has been tackled already and if there is
> a
> >> "standard" solution.
> >
> > If I understand you correctly, it seems that you are looking for a
> > compound splitting or decompounding algorithm. Unfortunately there is
> > not a "standard solution" for this. There are many languages in the
> > world and for some usable compound splitting algorithms exist. There are
> > also attempts to create statistical universal algorithms.
> >
> > As you said, for English a simple sub-string search might suffice but
> > for other languages it more complex. I assume that you speak German. If
> > you have a document that contains the term "Verkehrsleitsystem" and your
> > search query is "Verkehr leiten", it's reasonable to assume that the
> > document is relevant to the search query. Unfortunately a sub-string
> > search could not find the document. Other languages are even more
> > difficult (a textbook on linguistics will explain this better than I
> can).
> >
> > Even if you have such algorithm, it's not trivial to score the results
> > and there are more aspects to consider to create a simple search
> > algorithm. For example, in English you will also have to do some
> > analysis of the phrase structure to identify open compounds.
> >
> > Perhaps it helps to mention the languages you are interested in and the
> > application you have in mind to evaluate whether the SQLite FTS5 could
> > meet your requirements.
> > _______________________________________________
> > sqlite-users mailing list
> > [hidden email]
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Some FTS5 guidance

Scott Hess
With fts4 you could search for matching terms in an fts4aux table, then use
those to construct a query against the original table.  You'd have a full
scan of the fts index, but you'd not have to do a full table scan of the
primary data.  Unfortunately if there were a large number of hits in the
index scan, then it would be cheaper to just do the full table scan and
skip the index scan.  I don't know if there's a similar thing for fts5 at
this time.

This wouldn't be as efficient as something more suited to substring matches
(an N-gram index, maybe?), but I haven't heard anyone talking about writing
a virtual table to do that.

-scott


On Fri, Jan 8, 2016 at 11:54 AM, Charles Leifer <[hidden email]> wrote:

> You can create a custom tokenizer as well then use the standard search
> APIs. I imagine that functionality would work well in this case:
> https://sqlite.org/fts5.html#section_7
>
> On Thu, Jan 7, 2016 at 3:59 PM, Stadin, Benjamin <
> [hidden email]> wrote:
>
> > One such algorithm would be a (generalized) Ukkonnen suffix tree (
> > https://en.m.wikipedia.org/wiki/Ukkonen%27s_algorithm).
> > It allows you to search efficiently for substrings.
> > It would be possible to do some match weigthing based on match distance
> > within words. But a general solution for a database is probably not
> trivial
> > to implement.
> >
> > Ben
> >
> > Von meinem iPad gesendet
> >
> > > Am 07.01.2016 um 21:46 schrieb Matthias-Christian Ott <[hidden email]>:
> > >
> > >> On 2016-01-07 19:31, Mario M. Westphal wrote:
> > >> I hence wonder if this problem has been tackled already and if there
> is
> > a
> > >> "standard" solution.
> > >
> > > If I understand you correctly, it seems that you are looking for a
> > > compound splitting or decompounding algorithm. Unfortunately there is
> > > not a "standard solution" for this. There are many languages in the
> > > world and for some usable compound splitting algorithms exist. There
> are
> > > also attempts to create statistical universal algorithms.
> > >
> > > As you said, for English a simple sub-string search might suffice but
> > > for other languages it more complex. I assume that you speak German. If
> > > you have a document that contains the term "Verkehrsleitsystem" and
> your
> > > search query is "Verkehr leiten", it's reasonable to assume that the
> > > document is relevant to the search query. Unfortunately a sub-string
> > > search could not find the document. Other languages are even more
> > > difficult (a textbook on linguistics will explain this better than I
> > can).
> > >
> > > Even if you have such algorithm, it's not trivial to score the results
> > > and there are more aspects to consider to create a simple search
> > > algorithm. For example, in English you will also have to do some
> > > analysis of the phrase structure to identify open compounds.
> > >
> > > Perhaps it helps to mention the languages you are interested in and the
> > > application you have in mind to evaluate whether the SQLite FTS5 could
> > > meet your requirements.
> > > _______________________________________________
> > > sqlite-users mailing list
> > > [hidden email]
> > > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> > _______________________________________________
> > sqlite-users mailing list
> > [hidden email]
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users