fts5 giving results for substring searches for Hindi content.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

fts5 giving results for substring searches for Hindi content.

raj Singla
Hi,

-- create fts4 and fts5 tables
create virtual table idx4 using "fts4" (content);
create virtual table idx5 using "fts5" (content);
-- insert 1 sample rows into eachinsert into idx4 (content) values
('नीरजा भनोट के कातिल पाकिस्तान की जेल में थे, फिर वे एफबीआई
आए?');insert into idx5 (content) values ('नीरजा भनोट के कातिल
पाकिस्तान की जेल में थे, फिर वे एफबीआई आए?');
-- query index using complete and partial stringsselect * from idx4
where idx4 match 'पाकिस्तान';-- returns नीरजा भनोट के कातिल पाकिस्तान
की जेल में थे, फिर वे एफबीआई आए?
select * from idx4 where idx4 match 'पाकि';-- no results returned
select * from idx5 where idx5 match 'पाकिस्तान';-- returns नीरजा भनोट
के कातिल पाकिस्तान की जेल में थे, फिर वे एफबीआई आए?
select * from idx5 where idx5 match 'पाकि';-- returns नीरजा भनोट के
कातिल पाकिस्तान की जेल में थे, फिर वे एफबीआई आए?


fts5 giving results for substring searches for Hindi content.
Is this expected behavior.
Please if you can provide more insights on this. Maybe this is just an
experimental feature.

Thank You,
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: fts5 giving results for substring searches for Hindi content.

Clemens Ladisch
raj Singla wrote:
> create virtual table idx4 using "fts4" (content);
> create virtual table idx5 using "fts5" (content);
> ...
> select * from idx4 where idx4 match 'पाकि';-- no results returned
> select * from idx5 where idx5 match 'पाकि';-- returns नीरजा भनोट के

FTS4 and FTS5 have different defaults for the tokenizer:
http://www.sqlite.org/fts3.html#tokenizer
http://www.sqlite.org/fts5.html#tokenizers


Regards,
Clemens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: fts5 giving results for substring searches for Hindi content.

Dan Kennedy-4
In reply to this post by raj Singla
On 02/04/2018 11:39 AM, raj Singla wrote:

> Hi,
>
> -- create fts4 and fts5 tables
> create virtual table idx4 using "fts4" (content);
> create virtual table idx5 using "fts5" (content);
> -- insert 1 sample rows into eachinsert into idx4 (content) values
> ('नीरजा भनोट के कातिल पाकिस्तान की जेल में थे, फिर वे एफबीआई
> आए?');insert into idx5 (content) values ('नीरजा भनोट के कातिल
> पाकिस्तान की जेल में थे, फिर वे एफबीआई आए?');
> -- query index using complete and partial stringsselect * from idx4
> where idx4 match 'पाकिस्तान';-- returns नीरजा भनोट के कातिल पाकिस्तान
> की जेल में थे, फिर वे एफबीआई आए?
> select * from idx4 where idx4 match 'पाकि';-- no results returned
> select * from idx5 where idx5 match 'पाकिस्तान';-- returns नीरजा भनोट
> के कातिल पाकिस्तान की जेल में थे, फिर वे एफबीआई आए?
> select * from idx5 where idx5 match 'पाकि';-- returns नीरजा भनोट के
> कातिल पाकिस्तान की जेल में थे, फिर वे एफबीआई आए?
>
>
> fts5 giving results for substring searches for Hindi content.
> Is this expected behavior.
> Please if you can provide more insights on this. Maybe this is just an
> experimental feature.

By default, FTS5 uses a unicode tokenizer based on data extracted from
reference file "UnicodeData.txt":

http://www.unicode.org/Public/6.1.0/ucd/UnicodeData.txt

Which divides the characters into categories:

   http://www.fileformat.info/info/unicode/category/index.htm

FTS5 considers categories "Co", "L*" and "N*" to be token characters and
all others to be separator characters (handled in the same way as spaces).

The string "पाकिस्तान" contains 9 characters, 3 of which are from the
"Mn" and "Mc" categories, specifically 0x93E, 0x93F, 0x94D and 0x93E.
According to UnicodeData.txt, these characters are:

   093E;DEVANAGARI VOWEL SIGN AA;Mc;
   093F;DEVANAGARI VOWEL SIGN I;Mc;
   094D;DEVANAGARI SIGN VIRAMA;Mn;

And so the string is being split into several (actually 5 - as there are
two instances of 0x93E) different words. Given your report, I'm guessing
that is not what people expect. Can you, or any other Hindi speaker,
confirm that "पाकिस्तान" should be treated as a single word by FTS5? And
not broken into several different words?

Dan.












>
> Thank You,
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users