Full text search without full phrase matches

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Full text search without full phrase matches

pcarb
I had to implement something like this for comparing passages from statutes (see the Introduction in Douglas Hay and Paul Craven, *Masters, Servants and Magistrates in Britain and the Empire, 1562-1955* [UNCP Press, 2004] for an illustration).

You need to isolate the keywords, in whatever order, count them, and measure the distances (number of words) between them.  SqLite is great for managing the tables of keywords, the lists of texts that contain them, and tables of distances.  But it is not the optimal tool for breaking down the texts and extracting the keywords and distances.  I used Perl for this job, and found that I could easily adapt recipes from the Perl Cookbook and similar repositories to build my routines.  I wrote the disaggregated lists of keywords, distances and texts as sql tables and analysed them in SqLite.

Paul Craven
York University

----------------------------------

Date: Wed, 13 Jun 2012 23:09:35 +0200
From: Philip Bennefall <[hidden email]>
To: <[hidden email]>
Subject: [sqlite] Full text search without full phrase matches
Message-ID: <A12309DB130E42BBA0590D664F66922A@chicken>
Content-Type: text/plain; charset="iso-8859-1"

Hi all,

I am new to this maling list and to SqLite, so I wanted to start by thanking all of those who make this project a reality. It is a great tool.

Now, to my question. I am trying to use the full text search feature to find rough matches for a chat robot. Basically I want to match as many keywords as possible, but not necessarily all of them. The results should be sorted based on how many keywords were found in the phrase and how closely ordered they are to the query. In other words the ordering doesn't have to be exact, but the closer it is, the higher the result should rank. Similarly, even if only one or two words in the phrase are found it should match, but rank higher the more of the words that are present. I have read the reference and I see the NEAR statement and the matchinfo function, as well as the example of how to use it, but I cannot figure out how to apply this knowledge to my specific problem. Does anyone have any suggestions?

Thanks in advance for your help.

Kind regards,

Philip Bennefall
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Full text search without full phrase matches

Philip Bennefall
Hi Paul, and thank you for your reply.

The trouble I have is that in my query, all the keywords don't necessarily
have to be present in order for a successful match to be made. SqLite's fts
only seems to match if all the keywords are present, which I don't require.

I am not familiar with Perl, but am working exclusively in C++.

The input I am processing is arbitrary, and so is the data that I am
searching through in the index. The incoming data is user messages, and the
index contains old messages that the robot has given to users (stemmed and
stripped in various ways to make matches more probable), and then there's
another column which contains an appropriate answer if that query is
matched. I want it to match as many keywords as possible but not necessarily
all, and order by:
1. How many keywords were matched, with some minimum threshold below which
no match is made.
2. How well the ordering matched.

Do you have any tips?

Kind regards,

Philip Bennefall
----- Original Message -----
From: <[hidden email]>
To: <[hidden email]>
Sent: Thursday, June 14, 2012 7:01 PM
Subject: [sqlite] Full text search without full phrase matches


I had to implement something like this for comparing passages from statutes
(see the Introduction in Douglas Hay and Paul Craven, *Masters, Servants and
Magistrates in Britain and the Empire, 1562-1955* [UNCP Press, 2004] for an
illustration).

You need to isolate the keywords, in whatever order, count them, and measure
the distances (number of words) between them.  SqLite is great for managing
the tables of keywords, the lists of texts that contain them, and tables of
distances.  But it is not the optimal tool for breaking down the texts and
extracting the keywords and distances.  I used Perl for this job, and found
that I could easily adapt recipes from the Perl Cookbook and similar
repositories to build my routines.  I wrote the disaggregated lists of
keywords, distances and texts as sql tables and analysed them in SqLite.

Paul Craven
York University

----------------------------------

Date: Wed, 13 Jun 2012 23:09:35 +0200
From: Philip Bennefall <[hidden email]>
To: <[hidden email]>
Subject: [sqlite] Full text search without full phrase matches
Message-ID: <A12309DB130E42BBA0590D664F66922A@chicken>
Content-Type: text/plain; charset="iso-8859-1"

Hi all,

I am new to this maling list and to SqLite, so I wanted to start by thanking
all of those who make this project a reality. It is a great tool.

Now, to my question. I am trying to use the full text search feature to find
rough matches for a chat robot. Basically I want to match as many keywords
as possible, but not necessarily all of them. The results should be sorted
based on how many keywords were found in the phrase and how closely ordered
they are to the query. In other words the ordering doesn't have to be exact,
but the closer it is, the higher the result should rank. Similarly, even if
only one or two words in the phrase are found it should match, but rank
higher the more of the words that are present. I have read the reference and
I see the NEAR statement and the matchinfo function, as well as the example
of how to use it, but I cannot figure out how to apply this knowledge to my
specific problem. Does anyone have any suggestions?

Thanks in advance for your help.

Kind regards,

Philip Bennefall
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users 

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Full text search without full phrase matches

Simon Slavin-3

On 14 Jun 2012, at 6:12pm, Philip Bennefall <[hidden email]> wrote:

> The trouble I have is that in my query, all the keywords don't necessarily have to be present in order for a successful match to be made. SqLite's fts only seems to match if all the keywords are present, which I don't require.

You will have to do some of this in your own programming.  In fact you may end up doing all of it in your own programming and using the FTS feature only to match single words and single word-fragments.

The usual way to do this is to define a 'distance' metric for comparing two strings and finding how far apart they are.  A score of 0 means they match exactly.  Numbers bigger than a certain amount don't matter: anything bigger than, say, 100 means they're not at all alike.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Full text search without full phrase matches

Philip Bennefall
Hi Simon,

That is unfortunate, if it is true that there's no way to accomplish this
with SqLite. To do just plain matching I can use an unordered hash map, so I
wouldn't need a database for that. The trouble with a string distance
function is that I can't really process the entire dataset with it. SqLite
technically has all the features I'm after, I just don't want it to
necessarily match all the words in a query. If I can get it to match all as
well as some, that would be enough. I could then do distancing on a
considerably smaller dataset which would be the result of the broader SqLite
search.

So I guess my main question is, is there absolutely no way to match a subset
of the words in a query?

Kind regards,

Philip Bennefall
----- Original Message -----
From: "Simon Slavin" <[hidden email]>
To: "General Discussion of SQLite Database" <[hidden email]>
Sent: Thursday, June 14, 2012 7:24 PM
Subject: Re: [sqlite] Full text search without full phrase matches



On 14 Jun 2012, at 6:12pm, Philip Bennefall <[hidden email]> wrote:

> The trouble I have is that in my query, all the keywords don't necessarily
> have to be present in order for a successful match to be made. SqLite's
> fts only seems to match if all the keywords are present, which I don't
> require.

You will have to do some of this in your own programming.  In fact you may
end up doing all of it in your own programming and using the FTS feature
only to match single words and single word-fragments.

The usual way to do this is to define a 'distance' metric for comparing two
strings and finding how far apart they are.  A score of 0 means they match
exactly.  Numbers bigger than a certain amount don't matter: anything bigger
than, say, 100 means they're not at all alike.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Full text search without full phrase matches

Simon Slavin-3

On 14 Jun 2012, at 7:13pm, Philip Bennefall <[hidden email]> wrote:

> That is unfortunate, if it is true that there's no way to accomplish this with SqLite. To do just plain matching I can use an unordered hash map, so I wouldn't need a database for that. The trouble with a string distance function is that I can't really process the entire dataset with it. SqLite technically has all the features I'm after, I just don't want it to necessarily match all the words in a query. If I can get it to match all as well as some, that would be enough. I could then do distancing on a considerably smaller dataset which would be the result of the broader SqLite search.
>
> So I guess my main question is, is there absolutely no way to match a subset of the words in a query?

Well, you could write that string distance function and add it to your copy of SQLite as an external function.  Then you could do things like

SELECT string_distance(theText, 'this new piece of text'), theText FROM oldChats WHERE string_distance(theText, 'this new piece of text') < 10 ORDER BY string_distance(theText, 'this new piece of text')

(I don't know whether SQLite will optimise that to avoid executing the same function many times, or whether you can name a column and use that name to do the same thing yourself.)

Here's the documentation for external functions:

<http://www.sqlite.org/c3ref/create_function.html>

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Full text search without full phrase matches

Philip Bennefall
Hi Simon,

The ordering is not really the issue I am having. That, I can do if I just
get a result back that doesn't necessarily match all the keywords. In the
query you showed as an example, all the keywords would still have to match
in order for a row to be returned. The sorting is a separate problem that is
not really that difficult once I get a smaller dataset. Then I can order it
manually. The problem is that it only returns a match if every single word
is present. I would like it to return matches if, say, mor than 2 or 3 of
the specified keywords are found.

Kind regards,

Philip Bennefall
----- Original Message -----
From: "Simon Slavin" <[hidden email]>
To: "General Discussion of SQLite Database" <[hidden email]>
Sent: Thursday, June 14, 2012 8:24 PM
Subject: Re: [sqlite] Full text search without full phrase matches



On 14 Jun 2012, at 7:13pm, Philip Bennefall <[hidden email]> wrote:

> That is unfortunate, if it is true that there's no way to accomplish this
> with SqLite. To do just plain matching I can use an unordered hash map, so
> I wouldn't need a database for that. The trouble with a string distance
> function is that I can't really process the entire dataset with it. SqLite
> technically has all the features I'm after, I just don't want it to
> necessarily match all the words in a query. If I can get it to match all
> as well as some, that would be enough. I could then do distancing on a
> considerably smaller dataset which would be the result of the broader
> SqLite search.
>
> So I guess my main question is, is there absolutely no way to match a
> subset of the words in a query?

Well, you could write that string distance function and add it to your copy
of SQLite as an external function.  Then you could do things like

SELECT string_distance(theText, 'this new piece of text'), theText FROM
oldChats WHERE string_distance(theText, 'this new piece of text') < 10 ORDER
BY string_distance(theText, 'this new piece of text')

(I don't know whether SQLite will optimise that to avoid executing the same
function many times, or whether you can name a column and use that name to
do the same thing yourself.)

Here's the documentation for external functions:

<http://www.sqlite.org/c3ref/create_function.html>

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users 

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Full text search without full phrase matches

Simon Slavin-3

On 14 Jun 2012, at 7:32pm, Philip Bennefall <[hidden email]> wrote:

> The problem is that it only returns a match if every single word is present. I would like it to return matches if, say, mor than 2 or 3 of the specified keywords are found.

As far as I can figure, you need to write that function yourself.  Either as a SQLite function to build into SQLite, or in whatever programming language you want to use.  Sorry.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Full text search without full phrase matches

Black, Michael (IS)
In reply to this post by Philip Bennefall
Sounds to me like you want Lucene instead of SQLite

http://lucene.apache.org/core/



Michael D. Black

Senior Scientist

Advanced Analytics Directorate

Advanced GEOINT Solutions Operating Unit

Northrop Grumman Information Systems

________________________________
From: [hidden email] [[hidden email]] on behalf of Philip Bennefall [[hidden email]]
Sent: Thursday, June 14, 2012 1:32 PM
To: General Discussion of SQLite Database
Subject: EXT :Re: [sqlite] Full text search without full phrase matches

Hi Simon,

The ordering is not really the issue I am having. That, I can do if I just
get a result back that doesn't necessarily match all the keywords. In the
query you showed as an example, all the keywords would still have to match
in order for a row to be returned. The sorting is a separate problem that is
not really that difficult once I get a smaller dataset. Then I can order it
manually. The problem is that it only returns a match if every single word
is present. I would like it to return matches if, say, mor than 2 or 3 of
the specified keywords are found.

Kind regards,

Philip Bennefall
----- Original Message -----
From: "Simon Slavin" <[hidden email]>
To: "General Discussion of SQLite Database" <[hidden email]>
Sent: Thursday, June 14, 2012 8:24 PM
Subject: Re: [sqlite] Full text search without full phrase matches



On 14 Jun 2012, at 7:13pm, Philip Bennefall <[hidden email]> wrote:

> That is unfortunate, if it is true that there's no way to accomplish this
> with SqLite. To do just plain matching I can use an unordered hash map, so
> I wouldn't need a database for that. The trouble with a string distance
> function is that I can't really process the entire dataset with it. SqLite
> technically has all the features I'm after, I just don't want it to
> necessarily match all the words in a query. If I can get it to match all
> as well as some, that would be enough. I could then do distancing on a
> considerably smaller dataset which would be the result of the broader
> SqLite search.
>
> So I guess my main question is, is there absolutely no way to match a
> subset of the words in a query?

Well, you could write that string distance function and add it to your copy
of SQLite as an external function.  Then you could do things like

SELECT string_distance(theText, 'this new piece of text'), theText FROM
oldChats WHERE string_distance(theText, 'this new piece of text') < 10 ORDER
BY string_distance(theText, 'this new piece of text')

(I don't know whether SQLite will optimise that to avoid executing the same
function many times, or whether you can name a column and use that name to
do the same thing yourself.)

Here's the documentation for external functions:

<http://www.sqlite.org/c3ref/create_function.html>

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Full text search without full phrase matches

Philip Bennefall
Hi Michael,

That seems to be under either gpl/lgpl/apache licenses, which I cannot use in my project for various reasons. The reason I am so interested in SqLite is because it's public domain. I appreciate the tip though.

Kind regards,

Philip Bennefall
  ----- Original Message -----
  From: Black, Michael (IS)
  To: [hidden email] ; General Discussion of SQLite Database
  Sent: Thursday, June 14, 2012 9:03 PM
  Subject: Re: [sqlite] Full text search without full phrase matches


  Sounds to me like you want Lucene instead of SQLite

  http://lucene.apache.org/core/



  Michael D. Black

  Senior Scientist

  Advanced Analytics Directorate

  Advanced GEOINT Solutions Operating Unit

  Northrop Grumman Information Systems


------------------------------------------------------------------------------

  From: [hidden email] [[hidden email]] on behalf of Philip Bennefall [[hidden email]]
  Sent: Thursday, June 14, 2012 1:32 PM
  To: General Discussion of SQLite Database
  Subject: EXT :Re: [sqlite] Full text search without full phrase matches


  Hi Simon,

  The ordering is not really the issue I am having. That, I can do if I just
  get a result back that doesn't necessarily match all the keywords. In the
  query you showed as an example, all the keywords would still have to match
  in order for a row to be returned. The sorting is a separate problem that is
  not really that difficult once I get a smaller dataset. Then I can order it
  manually. The problem is that it only returns a match if every single word
  is present. I would like it to return matches if, say, mor than 2 or 3 of
  the specified keywords are found.

  Kind regards,

  Philip Bennefall
  ----- Original Message -----
  From: "Simon Slavin" <[hidden email]>
  To: "General Discussion of SQLite Database" <[hidden email]>
  Sent: Thursday, June 14, 2012 8:24 PM
  Subject: Re: [sqlite] Full text search without full phrase matches



  On 14 Jun 2012, at 7:13pm, Philip Bennefall <[hidden email]> wrote:

  > That is unfortunate, if it is true that there's no way to accomplish this
  > with SqLite. To do just plain matching I can use an unordered hash map, so
  > I wouldn't need a database for that. The trouble with a string distance
  > function is that I can't really process the entire dataset with it. SqLite
  > technically has all the features I'm after, I just don't want it to
  > necessarily match all the words in a query. If I can get it to match all
  > as well as some, that would be enough. I could then do distancing on a
  > considerably smaller dataset which would be the result of the broader
  > SqLite search.
  >
  > So I guess my main question is, is there absolutely no way to match a
  > subset of the words in a query?

  Well, you could write that string distance function and add it to your copy
  of SQLite as an external function.  Then you could do things like

  SELECT string_distance(theText, 'this new piece of text'), theText FROM
  oldChats WHERE string_distance(theText, 'this new piece of text') < 10 ORDER
  BY string_distance(theText, 'this new piece of text')

  (I don't know whether SQLite will optimise that to avoid executing the same
  function many times, or whether you can name a column and use that name to
  do the same thing yourself.)

  Here's the documentation for external functions:

  <http://www.sqlite.org/c3ref/create_function.html>

  Simon.
  _______________________________________________
  sqlite-users mailing list
  [hidden email]
  http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users 

  _______________________________________________
  sqlite-users mailing list
  [hidden email]
  http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Full text search without full phrase matches

Philip Bennefall
The main issue I have is that the Apache license, as well as the gpl/lgpl licenses, force you to include the entire license with any redistribution of the software (even if it is compiled as part of a derivative work) if my understanding is correct. I am not too keen on doing that. I certainly don't mind giving credit where credit is due, but I don't want to include this huge blob of legal text. This is why I like public domain software so much, as well as software distributed under licenses such as the Boost Software license, the Zlib/Libpng license and one or two others that are similar.

Kind regards,

Philip Bennefall
  ----- Original Message -----
  From: Black, Michael (IS)
  To: [hidden email]
  Sent: Thursday, June 14, 2012 9:34 PM
  Subject: Re: [sqlite] Full text search without full phrase matches


  Apache license is about as liberal as you can get.  Damned near public domain (may as well be for all intents and purposes).



  Muy I ask what the problem is?  Perhaps something I should aware of?





  Michael D. Black

  Senior Scientist

  Advanced Analytics Directorate

  Advanced GEOINT Solutions Operating Unit

  Northrop Grumman Information Systems


------------------------------------------------------------------------------

  From: [hidden email] [[hidden email]] on behalf of Philip Bennefall [[hidden email]]
  Sent: Thursday, June 14, 2012 2:09 PM
  To: General Discussion of SQLite Database
  Subject: EXT :Re: [sqlite] Full text search without full phrase matches


  Hi Michael,

  That seems to be under either gpl/lgpl/apache licenses, which I cannot use in my project for various reasons. The reason I am so interested in SqLite is because it's public domain. I appreciate the tip though.

  Kind regards,

  Philip Bennefall
    ----- Original Message -----
    From: Black, Michael (IS)
    To: [hidden email] ; General Discussion of SQLite Database
    Sent: Thursday, June 14, 2012 9:03 PM
    Subject: Re: [sqlite] Full text search without full phrase matches


    Sounds to me like you want Lucene instead of SQLite

    http://lucene.apache.org/core/



    Michael D. Black

    Senior Scientist

    Advanced Analytics Directorate

    Advanced GEOINT Solutions Operating Unit

    Northrop Grumman Information Systems


  ------------------------------------------------------------------------------

    From: [hidden email] [[hidden email]] on behalf of Philip Bennefall [[hidden email]]
    Sent: Thursday, June 14, 2012 1:32 PM
    To: General Discussion of SQLite Database
    Subject: EXT :Re: [sqlite] Full text search without full phrase matches


    Hi Simon,

    The ordering is not really the issue I am having. That, I can do if I just
    get a result back that doesn't necessarily match all the keywords. In the
    query you showed as an example, all the keywords would still have to match
    in order for a row to be returned. The sorting is a separate problem that is
    not really that difficult once I get a smaller dataset. Then I can order it
    manually. The problem is that it only returns a match if every single word
    is present. I would like it to return matches if, say, mor than 2 or 3 of
    the specified keywords are found.

    Kind regards,

    Philip Bennefall
    ----- Original Message -----
    From: "Simon Slavin" <[hidden email]>
    To: "General Discussion of SQLite Database" <[hidden email]>
    Sent: Thursday, June 14, 2012 8:24 PM
    Subject: Re: [sqlite] Full text search without full phrase matches



    On 14 Jun 2012, at 7:13pm, Philip Bennefall <[hidden email]> wrote:

    > That is unfortunate, if it is true that there's no way to accomplish this
    > with SqLite. To do just plain matching I can use an unordered hash map, so
    > I wouldn't need a database for that. The trouble with a string distance
    > function is that I can't really process the entire dataset with it. SqLite
    > technically has all the features I'm after, I just don't want it to
    > necessarily match all the words in a query. If I can get it to match all
    > as well as some, that would be enough. I could then do distancing on a
    > considerably smaller dataset which would be the result of the broader
    > SqLite search.
    >
    > So I guess my main question is, is there absolutely no way to match a
    > subset of the words in a query?

    Well, you could write that string distance function and add it to your copy
    of SQLite as an external function.  Then you could do things like

    SELECT string_distance(theText, 'this new piece of text'), theText FROM
    oldChats WHERE string_distance(theText, 'this new piece of text') < 10 ORDER
    BY string_distance(theText, 'this new piece of text')

    (I don't know whether SQLite will optimise that to avoid executing the same
    function many times, or whether you can name a column and use that name to
    do the same thing yourself.)

    Here's the documentation for external functions:

    <http://www.sqlite.org/c3ref/create_function.html>

    Simon.
    _______________________________________________
    sqlite-users mailing list
    [hidden email]
    http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users 

    _______________________________________________
    sqlite-users mailing list
    [hidden email]
    http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
  _______________________________________________
  sqlite-users mailing list
  [hidden email]
  http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users