[Spellfix] Searching for short words is very slow

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[Spellfix] Searching for short words is very slow

Philip Bennefall
Hi all,

I have been running some tests with spellfix using a table containing
about 300000 words, extracted from the Moby project's single word list
as well as names and places. Moby can be found at:
http://icon.shef.ac.uk/Moby/

I have noticed that searching for medium length to very long words is
very fast, but when I start searching for short words like "hi" and
"bye", the search time skyrockets. I think the longest search time was
about 1500 milliseconds (the average is somewhere around the 500 ms
mark). My table is set up as follows:

create virtual table if not exists dictionary using
spellfix1(edit_cost_table=editcosts);

When searching, I specify top=5 to get these timings.

Is there anything I can tweak to speed up the search for short words, or
is there anything that can be done by the developers to optimize this
further?

Kind regards,

Philip Bennefall
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [Spellfix] Searching for short words is very slow

Philip Bennefall
I have to amend my last message. The timings I just gave was for looking
up that word 10 times, not 1. So the longest time I've seen would be
about 150 ms. However, if you have a document with a few thousand words
we would still be looking at a significant total searching time. Is this
to be expected?

Kind regards,

Philip Bennefall
On 7/23/2014 11:57 PM, Philip Bennefall wrote:

> Hi all,
>
> I have been running some tests with spellfix using a table containing
> about 300000 words, extracted from the Moby project's single word list
> as well as names and places. Moby can be found at:
> http://icon.shef.ac.uk/Moby/
>
> I have noticed that searching for medium length to very long words is
> very fast, but when I start searching for short words like "hi" and
> "bye", the search time skyrockets. I think the longest search time was
> about 1500 milliseconds (the average is somewhere around the 500 ms
> mark). My table is set up as follows:
>
> create virtual table if not exists dictionary using
> spellfix1(edit_cost_table=editcosts);
>
> When searching, I specify top=5 to get these timings.
>
> Is there anything I can tweak to speed up the search for short words, or
> is there anything that can be done by the developers to optimize this
> further?
>
> Kind regards,
>
> Philip Bennefall
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [Spellfix] Searching for short words is very slow

Richard Hipp-3
On Wed, Jul 23, 2014 at 6:18 PM, Philip Bennefall <[hidden email]>
wrote:

> I have to amend my last message. The timings I just gave was for looking
> up that word 10 times, not 1. So the longest time I've seen would be about
> 150 ms. However, if you have a document with a few thousand words we would
> still be looking at a significant total searching time. Is this to be
> expected?
>

There is no expectation.

Spellfix is an experiment in doing fuzzy matching.  It was designed for a
specific customer who is doing spell-checking in real-time, as the text is
being entered.  Spellfix works way faster than the end user can enter text,
so performance is not an issue in its original purpose.

Perhaps you are using spellfix in a different way?  You are welcomed to do
so.  If you want to contribute ideas on how to improve spellfix for use in
different scenarios, we will welcome your input.

There are comments in the code explaining how spellfix works.  Please
review the principles of operation and then perhaps run a performance
analysis using gprof or cachegrind.  Then describe exactly what you are
doing and why it isn't working out for you and perhaps we can help.



--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [Spellfix] Searching for short words is very slow

Philip Bennefall
Hi Richard,

My application is basically just to take a text file as a command line
argument and run the spellchecker on it, showing an alert for each word
that is not found in the dictionary and giving the user some options.

After a bit of experimentation I concluded that one way to speed things
up is to store the entire dictionary in memory as a hash map and look
for exact matches. Only when an exact match isn't found do I fall back
to the spellfix table. This allowed me to scan a document with just over
86000 words in less than 500 milliseconds, which is more than acceptable
for my needs. Certainly not ideal if you aren't on a workstation, but
it's a reasonable tradeoff if memory is not an issue.

Perhaps something similar could be done in the spellfix table itself?
Have an indexed integer column containing a crc32 or similar for each
word in the dictionary so that we can look for exact matches very
quickly. We only fall back to the fuzzy search if no match is found. Can
you see any obvious drawbacks with this? If not, I'd like to put this
optimization forth as an initial suggestion. I'll write again if I can
think of anything else after reading the code more thoroughly.

Kind regards,

Philip Bennefall
On 7/24/2014 12:25 AM, Richard Hipp wrote:

>
>
>
> On Wed, Jul 23, 2014 at 6:18 PM, Philip Bennefall <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     I have to amend my last message. The timings I just gave was for
>     looking up that word 10 times, not 1. So the longest time I've
>     seen would be about 150 ms. However, if you have a document with a
>     few thousand words we would still be looking at a significant
>     total searching time. Is this to be expected?
>
>
> There is no expectation.
>
> Spellfix is an experiment in doing fuzzy matching.  It was designed
> for a specific customer who is doing spell-checking in real-time, as
> the text is being entered.  Spellfix works way faster than the end
> user can enter text, so performance is not an issue in its original
> purpose.
>
> Perhaps you are using spellfix in a different way?  You are welcomed
> to do so.  If you want to contribute ideas on how to improve spellfix
> for use in different scenarios, we will welcome your input.
>
> There are comments in the code explaining how spellfix works.  Please
> review the principles of operation and then perhaps run a performance
> analysis using gprof or cachegrind.  Then describe exactly what you
> are doing and why it isn't working out for you and perhaps we can help.
>
>
>
> --
> D. Richard Hipp
> [hidden email] <mailto:[hidden email]>

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users