Constraining FTS5 results based on offsets()

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Constraining FTS5 results based on offsets()

miroslav.marangozov@outlook.com
Hi,
For my use case I need to index some texts, and I also need be able to assign attributes to sub-sequences/spans of tokens in the texts. I want to be able to search only for keywords/phrases that occur in spans with a certain attribute. As an example, imagine we have a set of rich text documents and we want to find the locations of the "SQLite rocks" phrase in that set, but not just any instance of it- only those that have an attribute "bold".

The general idea, I'm considering at the moment is as follows:
1) create a table 'spans' with columns (doc_id, attrib_id, start_tok, end_tok), where doc_id is equal to the corresponding rowid in the FTS table, and start_tok and end_tok are the 0-based offsets that delimit a span
2) issue a match (sub?)query against the FTS table and obtain a list of (rowid, list-of-offsets in that doc)
3) somehow convert the above results into (rowid, start_offset, end_offset) for each entry of the list-of-offsets
4) join the results from 3) with the "spans" table on "rowid=spans.doc_id" where "spans.attrib_id=X and start_offset >= spans.start_pos and end_offset <= spans.end_pos"

Do you think this general approach makes sense, and how would you approach the problem if not? I only have a cursory knowledge of both SQL and SQLite at this point, so it's quite possible I'm missing something obvious.

On the implementation side:
- as far as I understand FTS5 has some clear advantages over FTS3/4. Apparently there isn't readily available offsets() function in FTS5 yet, but the API should make writing one rather straightforward

- step 3) is what I wonder about the most... What would be a good way to convert a (doc_id, <list-of-offsets>) row into (doc_id, start_offset, end_offset) tuples for every entry in the list? I'd guess I will have to implement some sort of virtual table(a.k.a. "table-valued function")? Perhaps a table-valued function that takes a FTS query as its parameter that it then uses to obtain (doc_id, <list-of-offsets>), and expose each hit as (doc_id, start_offset, end_offset) rows in the virtual table?

Any thoughts and ideas would be greatly appreciated.

Thank you!
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Constraining FTS5 results based on offsets()

Dan Kennedy-4
On 04/08/2018 02:00 PM, [hidden email] wrote:

> Hi,
> For my use case I need to index some texts, and I also need be able to assign attributes to sub-sequences/spans of tokens in the texts. I want to be able to search only for keywords/phrases that occur in spans with a certain attribute. As an example, imagine we have a set of rich text documents and we want to find the locations of the "SQLite rocks" phrase in that set, but not just any instance of it- only those that have an attribute "bold".
>
> The general idea, I'm considering at the moment is as follows:
> 1) create a table 'spans' with columns (doc_id, attrib_id, start_tok, end_tok), where doc_id is equal to the corresponding rowid in the FTS table, and start_tok and end_tok are the 0-based offsets that delimit a span
> 2) issue a match (sub?)query against the FTS table and obtain a list of (rowid, list-of-offsets in that doc)
> 3) somehow convert the above results into (rowid, start_offset, end_offset) for each entry of the list-of-offsets
> 4) join the results from 3) with the "spans" table on "rowid=spans.doc_id" where "spans.attrib_id=X and start_offset >= spans.start_pos and end_offset <= spans.end_pos"
>
> Do you think this general approach makes sense, and how would you approach the problem if not? I only have a cursory knowledge of both SQL and SQLite at this point, so it's quite possible I'm missing something obvious.
>
> On the implementation side:
> - as far as I understand FTS5 has some clear advantages over FTS3/4. Apparently there isn't readily available offsets() function in FTS5 yet, but the API should make writing one rather straightforward
>
> - step 3) is what I wonder about the most... What would be a good way to convert a (doc_id, <list-of-offsets>) row into (doc_id, start_offset, end_offset) tuples for every entry in the list? I'd guess I will have to implement some sort of virtual table(a.k.a. "table-valued function")? Perhaps a table-valued function that takes a FTS query as its parameter that it then uses to obtain (doc_id, <list-of-offsets>), and expose each hit as (doc_id, start_offset, end_offset) rows in the virtual table?
>
> Any thoughts and ideas would be greatly appreciated.

Maybe just implement an FTS5 function to do the filtering in step 4
above. So that your query looks like:

   SELECT * FROM fts5tbl('SQLite+rocks'), spans
   WHERE ft.rowid=spans.doc_id
     AND spans.attrib_id = 'bold'
     AND custom_fts5_function(fts5tbl, spans.start_tok, spans.end_tok);

where custom_fts5_function() returns true if there are any phrase hits
with token offsets between its two trailing arguments. Or whatever it is
you require.

   https://sqlite.org/fts5.html#custom_auxiliary_functions

Dan.





>
> Thank you!
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Constraining FTS5 results based on offsets()

miroslav.marangozov@outlook.com
Thank you Dan!
This should be much easier to implement than a full blown virtual table, but as far as I can tell(as I've mentioned, I'm not an expert in this) it doesn't do exactly what I want. I need all the locations of the search phrases, subject to the attribute constraint, in a document, but this query will only return the spans that contain one or more instance of the phrase. Or maybe I'm missing something?

Provided my understanding is correct, what do you think would be most the pragmatic way to change the query, so that it returns the locations of the phrases? What I imagine could work, even though it's admittedly ugly, is to write another UDF, let's call it "hacky_offsets()", that has some sort of "private" communication channel with custom_fts5_function(). So the query will be something like:

   SELECT fts5tbl.rowid, hacky_offsets()
   FROM fts5tbl('SQLite+rocks'), spans
   WHERE ft5tbl.rowid=spans.doc_id
    AND spans.attrib_id = 'bold'
    AND custom_fts5_function(fts5tbl, spans.start_tok, spans.end_tok);

When custom_fts5_function() finds instances of the phrases that fall between start_tok/end_tok it records these instances into a location hacky_offsets() knows about, so it can read them from there and return them to the application.
Would this even work and can you think of a more idiomatic/elegant implementation?

Thanks!


________________________________
From: sqlite-users <[hidden email]> on behalf of Dan Kennedy <[hidden email]>
Sent: Tuesday, April 10, 2018 11:49 AM
To: [hidden email]
Subject: Re: [sqlite] Constraining FTS5 results based on offsets()

On 04/08/2018 02:00 PM, [hidden email] wrote:

> Hi,
> For my use case I need to index some texts, and I also need be able to assign attributes to sub-sequences/spans of tokens in the texts. I want to be able to search only for keywords/phrases that occur in spans with a certain attribute. As an example, imagine we have a set of rich text documents and we want to find the locations of the "SQLite rocks" phrase in that set, but not just any instance of it- only those that have an attribute "bold".
>
> The general idea, I'm considering at the moment is as follows:
> 1) create a table 'spans' with columns (doc_id, attrib_id, start_tok, end_tok), where doc_id is equal to the corresponding rowid in the FTS table, and start_tok and end_tok are the 0-based offsets that delimit a span
> 2) issue a match (sub?)query against the FTS table and obtain a list of (rowid, list-of-offsets in that doc)
> 3) somehow convert the above results into (rowid, start_offset, end_offset) for each entry of the list-of-offsets
> 4) join the results from 3) with the "spans" table on "rowid=spans.doc_id" where "spans.attrib_id=X and start_offset >= spans.start_pos and end_offset <= spans.end_pos"
>
> Do you think this general approach makes sense, and how would you approach the problem if not? I only have a cursory knowledge of both SQL and SQLite at this point, so it's quite possible I'm missing something obvious.
>
> On the implementation side:
> - as far as I understand FTS5 has some clear advantages over FTS3/4. Apparently there isn't readily available offsets() function in FTS5 yet, but the API should make writing one rather straightforward
>
> - step 3) is what I wonder about the most... What would be a good way to convert a (doc_id, <list-of-offsets>) row into (doc_id, start_offset, end_offset) tuples for every entry in the list? I'd guess I will have to implement some sort of virtual table(a.k.a. "table-valued function")? Perhaps a table-valued function that takes a FTS query as its parameter that it then uses to obtain (doc_id, <list-of-offsets>), and expose each hit as (doc_id, start_offset, end_offset) rows in the virtual table?
>
> Any thoughts and ideas would be greatly appreciated.

Maybe just implement an FTS5 function to do the filtering in step 4
above. So that your query looks like:

   SELECT * FROM fts5tbl('SQLite+rocks'), spans
   WHERE ft.rowid=spans.doc_id
     AND spans.attrib_id = 'bold'
     AND custom_fts5_function(fts5tbl, spans.start_tok, spans.end_tok);

where custom_fts5_function() returns true if there are any phrase hits
with token offsets between its two trailing arguments. Or whatever it is
you require.

   https://sqlite.org/fts5.html#custom_auxiliary_functions

Dan.





>
> Thank you!
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users