Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

邱朗
Hi,
I had thought Unicode61 Tokenizer can support CJK -- Chinese Japanese Korean I verify my sqlite supports fts5


sqlite> pragma compile_options;
BUG_COMPATIBLE_20160819
COMPILER=clang-9.0.0
DEFAULT_CACHE_SIZE=2000
DEFAULT_CKPTFULLFSYNC
DEFAULT_JOURNAL_SIZE_LIMIT=32768
DEFAULT_PAGE_SIZE=4096
DEFAULT_SYNCHRONOUS=2
DEFAULT_WAL_SYNCHRONOUS=1
ENABLE_API_ARMOR
ENABLE_COLUMN_METADATA
ENABLE_DBSTAT_VTAB
ENABLE_FTS3
ENABLE_FTS3_PARENTHESIS
ENABLE_FTS3_TOKENIZER
ENABLE_FTS4
ENABLE_FTS5


sqlite> CREATE VIRTUAL TABLE ft5_test USING fts5(content, tokenize = 'porter unicode61 remove_diacritics 1');
sqlite> INSERT INTO ft5_test values('为什么不支持中文 fts5 does not seem to work for chinese');
sqlite> select * from ft5_test where ft5_test = '中文';
sqlite>
sqlite> select * from ft5_test where ft5_test = 'Chinese';
为什么不支持中文 fts5 does not seem to work for chinese


But to my surprise it can't find any CJK word at all. Why is that ?
Thanks,
Qiulang
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

Scott Robison-2
On Thu, Sep 20, 2018, 8:21 PM 邱朗 <[hidden email]> wrote:

> Hi,
> I had thought Unicode61 Tokenizer can support CJK -- Chinese Japanese
> Korean I verify my sqlite supports fts5
>
> {snipped}
>
> But to my surprise it can't find any CJK word at all. Why is that ?


Based on my experience with such things, I suspect that the tokenizer
requires whitespace between adjacent words, which is not the case with CJK.
Word breaks are implicit, not explicit.

Is the Unicode61 tokenizer based on ICU? I had to implement an algorithm
for software at work that used functionality from ICU to find CJK word
boundaries, so I believe it is possible, just not as straightforward as
whitespace delimited words.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

邱朗
https://www.sqlite.org/fts5.html said " The unicode tokenizer classifies all unicode characters as either "separator" or "token" characters. By default all space and punctuation characters, as defined by Unicode 6.1, are considered separators, and all other characters as token characters... "  I really doubt unicode tokenizer requires white space, that is ascii tokenizer.


That was why I thought it might work for CJK.


Qiulang


At 2018-09-21 13:03:54, "Scott Robison" <[hidden email]> wrote:

>On Thu, Sep 20, 2018, 8:21 PM 邱朗 <[hidden email]> wrote:
>
>> Hi,
>> I had thought Unicode61 Tokenizer can support CJK -- Chinese Japanese
>> Korean I verify my sqlite supports fts5
>>
>> {snipped}
>>
>> But to my surprise it can't find any CJK word at all. Why is that ?
>
>
>Based on my experience with such things, I suspect that the tokenizer
>requires whitespace between adjacent words, which is not the case with CJK.
>Word breaks are implicit, not explicit.
>
>Is the Unicode61 tokenizer based on ICU? I had to implement an algorithm
>for software at work that used functionality from ICU to find CJK word
>boundaries, so I believe it is possible, just not as straightforward as
>whitespace delimited words.
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

Scott Robison-2
On Fri, Sep 21, 2018 at 12:02 AM 邱朗 <[hidden email]> wrote:
>
> https://www.sqlite.org/fts5.html said " The unicode tokenizer classifies all unicode characters as either "separator" or "token" characters. By default all space and punctuation characters, as defined by Unicode 6.1, are considered separators, and all other characters as token characters... "  I really doubt unicode tokenizer requires white space, that is ascii tokenizer.

Forgive my imprecise use of language. I should have said separators
instead of whitespace. Regardless, CJK uses implicit separation
between words, and that description seems to indicate that the unicode
tokenizer expects explicit separators (be they whitespace or
punctuation or something else) between tokens.

> That was why I thought it might work for CJK.

I think it could be made to work, or at least, I have experience
making it work with CJK based on functionality exposed via ICU. I
don't know if the unicode tokenizer uses ICU or if the functionality
in ICU that I used is available in the unicode tables. Not
understanding any of the languages represented by CJK, I can't say
with any confidence how good my solution was, but it seemed to be good
enough for the use case of my management and customers in the impacted
regions.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

邱朗
>
>I think it could be made to work, or at least, I have experience
>making it work with CJK based on functionality exposed via ICU. I
>don't know if the unicode tokenizer uses ICU or if the functionality
>in ICU that I used is available in the unicode tables. Not
>understanding any of the languages represented by CJK, I can't say
>with any confidence how good my solution was, but it seemed to be good
>enough for the use case of my management and customers in the impacted
>regions.
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

I am Chinese and I know a little bit of Korean, I can help to test your product :D  All Jokes aside I also tried to build an ICU SQlite macOS version but I failed. All the document I googled seem outdated. e.g. I used this (and other solutions) but I just can not build a macOS version. Do you have any experience for that ?


./configure CFLAGS="-I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU `icu-config --cppflags`" LDFLAGS="-L/usr/local/opt/icu4c/lib `icu-config --ldflags`"


Thanks,
Qiulang
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

Scott Robison-2
On Fri, Sep 21, 2018 at 12:39 AM 邱朗 <[hidden email]> wrote:

>
> >I think it could be made to work, or at least, I have experience
> >making it work with CJK based on functionality exposed via ICU. I
> >don't know if the unicode tokenizer uses ICU or if the functionality
> >in ICU that I used is available in the unicode tables. Not
> >understanding any of the languages represented by CJK, I can't say
> >with any confidence how good my solution was, but it seemed to be good
> >enough for the use case of my management and customers in the impacted
> >regions.
>
> I am Chinese and I know a little bit of Korean, I can help to test your product :D  All Jokes aside I also tried to build an ICU SQlite macOS version but I failed. All the document I googled seem outdated. e.g. I used this (and other solutions) but I just can not build a macOS version. Do you have any experience for that ?
>
>
> ./configure CFLAGS="-I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU `icu-config --cppflags`" LDFLAGS="-L/usr/local/opt/icu4c/lib `icu-config --ldflags`"

I do not have macOS experience, and I don't have a lot of free time
either. Mainly I was trying to explain what I had done in the past and
what I used. Hopefully someone else will chime in with more details
about the Unicode61 tokenizer and whether it is standalone or depends
on ICU.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

Dan Kennedy-4
In reply to this post by 邱朗
On 09/21/2018 01:38 PM, 邱朗 wrote:

>>
>> I think it could be made to work, or at least, I have experience
>> making it work with CJK based on functionality exposed via ICU. I
>> don't know if the unicode tokenizer uses ICU or if the functionality
>> in ICU that I used is available in the unicode tables. Not
>> understanding any of the languages represented by CJK, I can't say
>> with any confidence how good my solution was, but it seemed to be good
>> enough for the use case of my management and customers in the impacted
>> regions.
>> _______________________________________________
>> sqlite-users mailing list
>> [hidden email]
>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
> I am Chinese and I know a little bit of Korean, I can help to test your product :D  All Jokes aside I also tried to build an ICU SQlite macOS version but I failed. All the document I googled seem outdated. e.g. I used this (and other solutions) but I just can not build a macOS version. Do you have any experience for that ?
>
>
> ./configure CFLAGS="-I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU `icu-config --cppflags`" LDFLAGS="-L/usr/local/opt/icu4c/lib `icu-config --ldflags`"




Can you post the complete output of the failed build attempt? Thanks.

Dan.



>
>
> Thanks,
> Qiulang
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

邱朗
Hi,

Thanks for replying my question. Following are the error I got when compiling sqlite-autoconf-3250100.tar.gz . The error looks similar to this old discussion  
http://sqlite.1065341.n5.nabble.com/compiling-Sqlite-with-ICU-td40641.html


I am using macOS 10.13 & xcode 10


Undefined symbols for architecture x86_64:
  "_u_errorName_62", referenced from:
      _icuFunctionError in sqlite3.o
  "_u_foldCase_62", referenced from:
      _icuOpen in sqlite3.o
      _icuLikeCompare in sqlite3.o
  "_u_isspace_62", referenced from:
      _icuNext in sqlite3.o
  "_u_strToLower_62", referenced from:
      _icuCaseFunc16 in sqlite3.o
  "_u_strToUTF8_62", referenced from:
      _icuNext in sqlite3.o
  "_u_strToUpper_62", referenced from:
      _icuCaseFunc16 in sqlite3.o
  "_ubrk_close_62", referenced from:
      _icuClose in sqlite3.o
  "_ubrk_current_62", referenced from:
      _icuNext in sqlite3.o
  "_ubrk_first_62", referenced from:
      _icuOpen in sqlite3.o
  "_ubrk_next_62", referenced from:
      _icuNext in sqlite3.o
  "_ubrk_open_62", referenced from:
      _icuOpen in sqlite3.o
  "_ucol_close_62", referenced from:
      _icuLoadCollation in sqlite3.o
      _icuCollationDel in sqlite3.o
  "_ucol_open_62", referenced from:
      _icuLoadCollation in sqlite3.o
  "_ucol_strcoll_62", referenced from:
      _icuCollationColl in sqlite3.o
  "_uregex_close_62", referenced from:
      _icuRegexpDelete in sqlite3.o
  "_uregex_matches_62", referenced from:
      _icuRegexpFunc in sqlite3.o
  "_uregex_open_62", referenced from:
      _icuRegexpFunc in sqlite3.o
  "_uregex_setText_62", referenced from:
      _icuRegexpFunc in sqlite3.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [libsqlite3.la] Error 1




在 2018-09-21 17:32:38,"Dan Kennedy" <[hidden email]> 写道:

>On 09/21/2018 01:38 PM, 邱朗 wrote:
>>>
>>> I think it could be made to work, or at least, I have experience
>>> making it work with CJK based on functionality exposed via ICU. I
>>> don't know if the unicode tokenizer uses ICU or if the functionality
>>> in ICU that I used is available in the unicode tables. Not
>>> understanding any of the languages represented by CJK, I can't say
>>> with any confidence how good my solution was, but it seemed to be good
>>> enough for the use case of my management and customers in the impacted
>>> regions.
>>> _______________________________________________
>>> sqlite-users mailing list
>>> [hidden email]
>>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>>
>> I am Chinese and I know a little bit of Korean, I can help to test your product :D  All Jokes aside I also tried to build an ICU SQlite macOS version but I failed. All the document I googled seem outdated. e.g. I used this (and other solutions) but I just can not build a macOS version. Do you have any experience for that ?
>>
>>
>> ./configure CFLAGS="-I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU `icu-config --cppflags`" LDFLAGS="-L/usr/local/opt/icu4c/lib `icu-config --ldflags`"
>
>Can you post the complete output of the failed build attempt? Thanks.
>
>Dan.
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

Dan Kennedy-4
On 09/21/2018 05:21 PM, 邱朗 wrote:
> Hi,
>
> Thanks for replying my question. Following are the error I got when compiling sqlite-autoconf-3250100.tar.gz . The error looks similar to this old discussion
> http://sqlite.1065341.n5.nabble.com/compiling-Sqlite-with-ICU-td40641.html
>
>
> I am using macOS 10.13 & xcode 10

The text below is just the error. If you post the compiler commands that
appear before it in the build log somebody might be able to spot the
problem.

 From the error message, it may be that you have mismatched ICU header
and library files, or it may be that not all required ICU libraries are
being linked. If you remove the -I... and -L... switches from your
command line does it make any difference?

Dan.



>
>
> Undefined symbols for architecture x86_64:
>   "_u_errorName_62", referenced from:
>       _icuFunctionError in sqlite3.o
>   "_u_foldCase_62", referenced from:
>       _icuOpen in sqlite3.o
>       _icuLikeCompare in sqlite3.o
>   "_u_isspace_62", referenced from:
>       _icuNext in sqlite3.o
>   "_u_strToLower_62", referenced from:
>       _icuCaseFunc16 in sqlite3.o
>   "_u_strToUTF8_62", referenced from:
>       _icuNext in sqlite3.o
>   "_u_strToUpper_62", referenced from:
>       _icuCaseFunc16 in sqlite3.o
>   "_ubrk_close_62", referenced from:
>       _icuClose in sqlite3.o
>   "_ubrk_current_62", referenced from:
>       _icuNext in sqlite3.o
>   "_ubrk_first_62", referenced from:
>       _icuOpen in sqlite3.o
>   "_ubrk_next_62", referenced from:
>       _icuNext in sqlite3.o
>   "_ubrk_open_62", referenced from:
>       _icuOpen in sqlite3.o
>   "_ucol_close_62", referenced from:
>       _icuLoadCollation in sqlite3.o
>       _icuCollationDel in sqlite3.o
>   "_ucol_open_62", referenced from:
>       _icuLoadCollation in sqlite3.o
>   "_ucol_strcoll_62", referenced from:
>       _icuCollationColl in sqlite3.o
>   "_uregex_close_62", referenced from:
>       _icuRegexpDelete in sqlite3.o
>   "_uregex_matches_62", referenced from:
>       _icuRegexpFunc in sqlite3.o
>   "_uregex_open_62", referenced from:
>       _icuRegexpFunc in sqlite3.o
>   "_uregex_setText_62", referenced from:
>       _icuRegexpFunc in sqlite3.o
> ld: symbol(s) not found for architecture x86_64
> clang: error: linker command failed with exit code 1 (use -v to see invocation)
> make: *** [libsqlite3.la] Error 1
>
>
>
>
> 在 2018-09-21 17:32:38,"Dan Kennedy" <[hidden email]> 写道:
>> On 09/21/2018 01:38 PM, 邱朗 wrote:
>>>>
>>>> I think it could be made to work, or at least, I have experience
>>>> making it work with CJK based on functionality exposed via ICU. I
>>>> don't know if the unicode tokenizer uses ICU or if the functionality
>>>> in ICU that I used is available in the unicode tables. Not
>>>> understanding any of the languages represented by CJK, I can't say
>>>> with any confidence how good my solution was, but it seemed to be good
>>>> enough for the use case of my management and customers in the impacted
>>>> regions.
>>>> _______________________________________________
>>>> sqlite-users mailing list
>>>> [hidden email]
>>>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>>>
>>> I am Chinese and I know a little bit of Korean, I can help to test your product :D  All Jokes aside I also tried to build an ICU SQlite macOS version but I failed. All the document I googled seem outdated. e.g. I used this (and other solutions) but I just can not build a macOS version. Do you have any experience for that ?
>>>
>>>
>>> ./configure CFLAGS="-I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU `icu-config --cppflags`" LDFLAGS="-L/usr/local/opt/icu4c/lib `icu-config --ldflags`"
>>
>> Can you post the complete output of the failed build attempt? Thanks.
>>
>> Dan.
>>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

邱朗
I actually first used  ./configure CFLAGS="-DSQLITE_ENABLE_ICU `icu-config --cppflags`" LDFLAGS="`icu-config --ldflags`"  But I got the error


sqlite3.c:184184:10: fatal error: 'unicode/utypes.h' file not found
#include <unicode/utypes.h>


Then I added -I -L switches and if I remembered correct I used brew to install icu4c. The compiler command are  these


qiulangs-MacBook-Pro:sqlite-autoconf-3250100 qiulang$ make
/bin/sh ./libtool  --tag=CC   --mode=compile gcc -DPACKAGE_NAME=\"sqlite\" -DPACKAGE_TARNAME=\"sqlite\" -DPACKAGE_VERSION=\"3.25.1\" -DPACKAGE_STRING=\"sqlite\ 3.25.1\" -DPACKAGE_BUGREPORT=\"http://www.sqlite.org\" -DPACKAGE_URL=\"\" -DPACKAGE=\"sqlite\" -DVERSION=\"3.25.1\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_FDATASYNC=1 -DHAVE_USLEEP=1 -DHAVE_LOCALTIME_R=1 -DHAVE_GMTIME_R=1 -DHAVE_DECL_STRERROR_R=1 -DHAVE_STRERROR_R=1 -DHAVE_EDITLINE_READLINE_H=1 -DHAVE_READLINE_READLINE_H=1 -DHAVE_READLINE=1 -DHAVE_ZLIB_H=1 -I.    -D_REENTRANT=1 -DSQLITE_THREADSAFE=1 -DSQLITE_ENABLE_FTS4 -DSQLITE_ENABLE_FTS5 -DSQLITE_ENABLE_JSON1 -DSQLITE_ENABLE_RTREE -DSQLITE_HAVE_ZLIB  -I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU  -MT sqlite3.lo -MD -MP -MF .deps/sqlite3.Tpo -c -o sqlite3.lo sqlite3.c
libtool: compile:  gcc -DPACKAGE_NAME=\"sqlite\" -DPACKAGE_TARNAME=\"sqlite\" -DPACKAGE_VERSION=\"3.25.1\" "-DPACKAGE_STRING=\"sqlite 3.25.1\"" -DPACKAGE_BUGREPORT=\"http://www.sqlite.org\" -DPACKAGE_URL=\"\" -DPACKAGE=\"sqlite\" -DVERSION=\"3.25.1\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_FDATASYNC=1 -DHAVE_USLEEP=1 -DHAVE_LOCALTIME_R=1 -DHAVE_GMTIME_R=1 -DHAVE_DECL_STRERROR_R=1 -DHAVE_STRERROR_R=1 -DHAVE_EDITLINE_READLINE_H=1 -DHAVE_READLINE_READLINE_H=1 -DHAVE_READLINE=1 -DHAVE_ZLIB_H=1 -I. -D_REENTRANT=1 -DSQLITE_THREADSAFE=1 -DSQLITE_ENABLE_FTS4 -DSQLITE_ENABLE_FTS5 -DSQLITE_ENABLE_JSON1 -DSQLITE_ENABLE_RTREE -DSQLITE_HAVE_ZLIB -I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU -MT sqlite3.lo -MD -MP -MF .deps/sqlite3.Tpo -c sqlite3.c  -fno-common -DPIC -o .libs/sqlite3.o
libtool: compile:  gcc -DPACKAGE_NAME=\"sqlite\" -DPACKAGE_TARNAME=\"sqlite\" -DPACKAGE_VERSION=\"3.25.1\" "-DPACKAGE_STRING=\"sqlite 3.25.1\"" -DPACKAGE_BUGREPORT=\"http://www.sqlite.org\" -DPACKAGE_URL=\"\" -DPACKAGE=\"sqlite\" -DVERSION=\"3.25.1\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_FDATASYNC=1 -DHAVE_USLEEP=1 -DHAVE_LOCALTIME_R=1 -DHAVE_GMTIME_R=1 -DHAVE_DECL_STRERROR_R=1 -DHAVE_STRERROR_R=1 -DHAVE_EDITLINE_READLINE_H=1 -DHAVE_READLINE_READLINE_H=1 -DHAVE_READLINE=1 -DHAVE_ZLIB_H=1 -I. -D_REENTRANT=1 -DSQLITE_THREADSAFE=1 -DSQLITE_ENABLE_FTS4 -DSQLITE_ENABLE_FTS5 -DSQLITE_ENABLE_JSON1 -DSQLITE_ENABLE_RTREE -DSQLITE_HAVE_ZLIB -I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU -MT sqlite3.lo -MD -MP -MF .deps/sqlite3.Tpo -c sqlite3.c -o sqlite3.o >/dev/null 2>&1
mv -f .deps/sqlite3.Tpo .deps/sqlite3.Plo
/bin/sh ./libtool  --tag=CC   --mode=link gcc -D_REENTRANT=1 -DSQLITE_THREADSAFE=1 -DSQLITE_ENABLE_FTS4 -DSQLITE_ENABLE_FTS5 -DSQLITE_ENABLE_JSON1 -DSQLITE_ENABLE_RTREE -DSQLITE_HAVE_ZLIB  -I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU  -no-undefined -version-info 8:6:8 -L/usr/local/opt/icu4c/lib  -o libsqlite3.la -rpath /usr/local/lib sqlite3.lo  -lz
libtool: link: gcc -dynamiclib  -o .libs/libsqlite3.0.dylib  .libs/sqlite3.o   -L/usr/local/opt/icu4c/lib -lz    -install_name  /usr/local/lib/libsqlite3.0.dylib -compatibility_version 9 -current_version 9.6 -Wl,-single_module
Undefined symbols for architecture x86_64:
  "_u_errorName_62", referenced from:
      _icuFunctionError in sqlite3.o
  "_u_foldCase_62", referenced from:
...





At 2018-09-21 21:52:30, "Dan Kennedy" <[hidden email]> wrote:

>On 09/21/2018 05:21 PM, 邱朗 wrote:
>> Hi,
>>
>> Thanks for replying my question. Following are the error I got when compiling sqlite-autoconf-3250100.tar.gz . The error looks similar to this old discussion
>> http://sqlite.1065341.n5.nabble.com/compiling-Sqlite-with-ICU-td40641.html
>>
>>
>> I am using macOS 10.13 & xcode 10
>
>The text below is just the error. If you post the compiler commands that
>appear before it in the build log somebody might be able to spot the
>problem.
>
> From the error message, it may be that you have mismatched ICU header
>and library files, or it may be that not all required ICU libraries are
>being linked. If you remove the -I... and -L... switches from your
>command line does it make any difference?
>
>Dan.
>
>
>
>>
>>
>> Undefined symbols for architecture x86_64:
>>   "_u_errorName_62", referenced from:
>>       _icuFunctionError in sqlite3.o
>>   "_u_foldCase_62", referenced from:
>>       _icuOpen in sqlite3.o
>>       _icuLikeCompare in sqlite3.o
>>   "_u_isspace_62", referenced from:
>>       _icuNext in sqlite3.o
>>   "_u_strToLower_62", referenced from:
>>       _icuCaseFunc16 in sqlite3.o
>>   "_u_strToUTF8_62", referenced from:
>>       _icuNext in sqlite3.o
>>   "_u_strToUpper_62", referenced from:
>>       _icuCaseFunc16 in sqlite3.o
>>   "_ubrk_close_62", referenced from:
>>       _icuClose in sqlite3.o
>>   "_ubrk_current_62", referenced from:
>>       _icuNext in sqlite3.o
>>   "_ubrk_first_62", referenced from:
>>       _icuOpen in sqlite3.o
>>   "_ubrk_next_62", referenced from:
>>       _icuNext in sqlite3.o
>>   "_ubrk_open_62", referenced from:
>>       _icuOpen in sqlite3.o
>>   "_ucol_close_62", referenced from:
>>       _icuLoadCollation in sqlite3.o
>>       _icuCollationDel in sqlite3.o
>>   "_ucol_open_62", referenced from:
>>       _icuLoadCollation in sqlite3.o
>>   "_ucol_strcoll_62", referenced from:
>>       _icuCollationColl in sqlite3.o
>>   "_uregex_close_62", referenced from:
>>       _icuRegexpDelete in sqlite3.o
>>   "_uregex_matches_62", referenced from:
>>       _icuRegexpFunc in sqlite3.o
>>   "_uregex_open_62", referenced from:
>>       _icuRegexpFunc in sqlite3.o
>>   "_uregex_setText_62", referenced from:
>>       _icuRegexpFunc in sqlite3.o
>> ld: symbol(s) not found for architecture x86_64
>> clang: error: linker command failed with exit code 1 (use -v to see invocation)
>> make: *** [libsqlite3.la] Error 1
>>
>>
>>
>>
>> 在 2018-09-21 17:32:38,"Dan Kennedy" <[hidden email]> 写道:
>>> On 09/21/2018 01:38 PM, 邱朗 wrote:
>>>>>
>>>>> I think it could be made to work, or at least, I have experience
>>>>> making it work with CJK based on functionality exposed via ICU. I
>>>>> don't know if the unicode tokenizer uses ICU or if the functionality
>>>>> in ICU that I used is available in the unicode tables. Not
>>>>> understanding any of the languages represented by CJK, I can't say
>>>>> with any confidence how good my solution was, but it seemed to be good
>>>>> enough for the use case of my management and customers in the impacted
>>>>> regions.
>>>>> _______________________________________________
>>>>> sqlite-users mailing list
>>>>> [hidden email]
>>>>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>>>>
>>>> I am Chinese and I know a little bit of Korean, I can help to test your product :D  All Jokes aside I also tried to build an ICU SQlite macOS version but I failed. All the document I googled seem outdated. e.g. I used this (and other solutions) but I just can not build a macOS version. Do you have any experience for that ?
>>>>
>>>>
>>>> ./configure CFLAGS="-I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU `icu-config --cppflags`" LDFLAGS="-L/usr/local/opt/icu4c/lib `icu-config --ldflags`"
>>>
>>> Can you post the complete output of the failed build attempt? Thanks.
>>>
>>> Dan.
>>>
>> _______________________________________________
>> sqlite-users mailing list
>> [hidden email]
>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>>
>
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

Dan Kennedy-4
On 09/21/2018 09:44 PM, 邱朗 wrote:
> I actually first used  ./configure CFLAGS="-DSQLITE_ENABLE_ICU `icu-config --cppflags`" LDFLAGS="`icu-config --ldflags`"  But I got the error

When you ran this configure command, is the first line out output
something like the following?

   bash: icu-config: command not found

Is [icu-config] actually in your path? And if so, what does the
[icu-config --ldflags] command return?

Dan.




>
>
> sqlite3.c:184184:10: fatal error: 'unicode/utypes.h' file not found
> #include <unicode/utypes.h>
>
>
> Then I added -I -L switches and if I remembered correct I used brew to install icu4c. The compiler command are  these
>
>
> qiulangs-MacBook-Pro:sqlite-autoconf-3250100 qiulang$ make
> /bin/sh ./libtool  --tag=CC   --mode=compile gcc -DPACKAGE_NAME=\"sqlite\" -DPACKAGE_TARNAME=\"sqlite\" -DPACKAGE_VERSION=\"3.25.1\" -DPACKAGE_STRING=\"sqlite\ 3.25.1\" -DPACKAGE_BUGREPORT=\"http://www.sqlite.org\" -DPACKAGE_URL=\"\" -DPACKAGE=\"sqlite\" -DVERSION=\"3.25.1\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_FDATASYNC=1 -DHAVE_USLEEP=1 -DHAVE_LOCALTIME_R=1 -DHAVE_GMTIME_R=1 -DHAVE_DECL_STRERROR_R=1 -DHAVE_STRERROR_R=1 -DHAVE_EDITLINE_READLINE_H=1 -DHAVE_READLINE_READLINE_H=1 -DHAVE_READLINE=1 -DHAVE_ZLIB_H=1 -I.    -D_REENTRANT=1 -DSQLITE_THREADSAFE=1 -DSQLITE_ENABLE_FTS4 -DSQLITE_ENABLE_FTS5 -DSQLITE_ENABLE_JSON1 -DSQLITE_ENABLE_RTREE -DSQLITE_HAVE_ZLIB  -I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU  -MT sqlite3.lo -MD -MP -MF .deps/sqlite3.Tpo -c -o sqlite3.lo sqlite3.c
> libtool: compile:  gcc -DPACKAGE_NAME=\"sqlite\" -DPACKAGE_TARNAME=\"sqlite\" -DPACKAGE_VERSION=\"3.25.1\" "-DPACKAGE_STRING=\"sqlite 3.25.1\"" -DPACKAGE_BUGREPORT=\"http://www.sqlite.org\" -DPACKAGE_URL=\"\" -DPACKAGE=\"sqlite\" -DVERSION=\"3.25.1\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_FDATASYNC=1 -DHAVE_USLEEP=1 -DHAVE_LOCALTIME_R=1 -DHAVE_GMTIME_R=1 -DHAVE_DECL_STRERROR_R=1 -DHAVE_STRERROR_R=1 -DHAVE_EDITLINE_READLINE_H=1 -DHAVE_READLINE_READLINE_H=1 -DHAVE_READLINE=1 -DHAVE_ZLIB_H=1 -I. -D_REENTRANT=1 -DSQLITE_THREADSAFE=1 -DSQLITE_ENABLE_FTS4 -DSQLITE_ENABLE_FTS5 -DSQLITE_ENABLE_JSON1 -DSQLITE_ENABLE_RTREE -DSQLITE_HAVE_ZLIB -I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU -MT sqlite3.lo -MD -MP -MF .deps/sqlite3.Tpo -c sqlite3.c  -fno-common -DPIC -o .libs/sqlite3.o
> libtool: compile:  gcc -DPACKAGE_NAME=\"sqlite\" -DPACKAGE_TARNAME=\"sqlite\" -DPACKAGE_VERSION=\"3.25.1\" "-DPACKAGE_STRING=\"sqlite 3.25.1\"" -DPACKAGE_BUGREPORT=\"http://www.sqlite.org\" -DPACKAGE_URL=\"\" -DPACKAGE=\"sqlite\" -DVERSION=\"3.25.1\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_FDATASYNC=1 -DHAVE_USLEEP=1 -DHAVE_LOCALTIME_R=1 -DHAVE_GMTIME_R=1 -DHAVE_DECL_STRERROR_R=1 -DHAVE_STRERROR_R=1 -DHAVE_EDITLINE_READLINE_H=1 -DHAVE_READLINE_READLINE_H=1 -DHAVE_READLINE=1 -DHAVE_ZLIB_H=1 -I. -D_REENTRANT=1 -DSQLITE_THREADSAFE=1 -DSQLITE_ENABLE_FTS4 -DSQLITE_ENABLE_FTS5 -DSQLITE_ENABLE_JSON1 -DSQLITE_ENABLE_RTREE -DSQLITE_HAVE_ZLIB -I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU -MT sqlite3.lo -MD -MP -MF .deps/sqlite3.Tpo -c sqlite3.c -o sqlite3.o >/dev/null 2>&1
> mv -f .deps/sqlite3.Tpo .deps/sqlite3.Plo
> /bin/sh ./libtool  --tag=CC   --mode=link gcc -D_REENTRANT=1 -DSQLITE_THREADSAFE=1 -DSQLITE_ENABLE_FTS4 -DSQLITE_ENABLE_FTS5 -DSQLITE_ENABLE_JSON1 -DSQLITE_ENABLE_RTREE -DSQLITE_HAVE_ZLIB  -I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU  -no-undefined -version-info 8:6:8 -L/usr/local/opt/icu4c/lib  -o libsqlite3.la -rpath /usr/local/lib sqlite3.lo  -lz
> libtool: link: gcc -dynamiclib  -o .libs/libsqlite3.0.dylib  .libs/sqlite3.o   -L/usr/local/opt/icu4c/lib -lz    -install_name  /usr/local/lib/libsqlite3.0.dylib -compatibility_version 9 -current_version 9.6 -Wl,-single_module
> Undefined symbols for architecture x86_64:
>   "_u_errorName_62", referenced from:
>       _icuFunctionError in sqlite3.o
>   "_u_foldCase_62", referenced from:
> ...
>
>
>
>
>
> At 2018-09-21 21:52:30, "Dan Kennedy" <[hidden email]> wrote:
>> On 09/21/2018 05:21 PM, 邱朗 wrote:
>>> Hi,
>>>
>>> Thanks for replying my question. Following are the error I got when compiling sqlite-autoconf-3250100.tar.gz . The error looks similar to this old discussion
>>> http://sqlite.1065341.n5.nabble.com/compiling-Sqlite-with-ICU-td40641.html
>>>
>>>
>>> I am using macOS 10.13 & xcode 10
>>
>> The text below is just the error. If you post the compiler commands that
>> appear before it in the build log somebody might be able to spot the
>> problem.
>>
>> From the error message, it may be that you have mismatched ICU header
>> and library files, or it may be that not all required ICU libraries are
>> being linked. If you remove the -I... and -L... switches from your
>> command line does it make any difference?
>>
>> Dan.
>>
>>
>>
>>>
>>>
>>> Undefined symbols for architecture x86_64:
>>>   "_u_errorName_62", referenced from:
>>>       _icuFunctionError in sqlite3.o
>>>   "_u_foldCase_62", referenced from:
>>>       _icuOpen in sqlite3.o
>>>       _icuLikeCompare in sqlite3.o
>>>   "_u_isspace_62", referenced from:
>>>       _icuNext in sqlite3.o
>>>   "_u_strToLower_62", referenced from:
>>>       _icuCaseFunc16 in sqlite3.o
>>>   "_u_strToUTF8_62", referenced from:
>>>       _icuNext in sqlite3.o
>>>   "_u_strToUpper_62", referenced from:
>>>       _icuCaseFunc16 in sqlite3.o
>>>   "_ubrk_close_62", referenced from:
>>>       _icuClose in sqlite3.o
>>>   "_ubrk_current_62", referenced from:
>>>       _icuNext in sqlite3.o
>>>   "_ubrk_first_62", referenced from:
>>>       _icuOpen in sqlite3.o
>>>   "_ubrk_next_62", referenced from:
>>>       _icuNext in sqlite3.o
>>>   "_ubrk_open_62", referenced from:
>>>       _icuOpen in sqlite3.o
>>>   "_ucol_close_62", referenced from:
>>>       _icuLoadCollation in sqlite3.o
>>>       _icuCollationDel in sqlite3.o
>>>   "_ucol_open_62", referenced from:
>>>       _icuLoadCollation in sqlite3.o
>>>   "_ucol_strcoll_62", referenced from:
>>>       _icuCollationColl in sqlite3.o
>>>   "_uregex_close_62", referenced from:
>>>       _icuRegexpDelete in sqlite3.o
>>>   "_uregex_matches_62", referenced from:
>>>       _icuRegexpFunc in sqlite3.o
>>>   "_uregex_open_62", referenced from:
>>>       _icuRegexpFunc in sqlite3.o
>>>   "_uregex_setText_62", referenced from:
>>>       _icuRegexpFunc in sqlite3.o
>>> ld: symbol(s) not found for architecture x86_64
>>> clang: error: linker command failed with exit code 1 (use -v to see invocation)
>>> make: *** [libsqlite3.la] Error 1
>>>
>>>
>>>
>>>
>>> 在 2018-09-21 17:32:38,"Dan Kennedy" <[hidden email]> 写道:
>>>> On 09/21/2018 01:38 PM, 邱朗 wrote:
>>>>>>
>>>>>> I think it could be made to work, or at least, I have experience
>>>>>> making it work with CJK based on functionality exposed via ICU. I
>>>>>> don't know if the unicode tokenizer uses ICU or if the functionality
>>>>>> in ICU that I used is available in the unicode tables. Not
>>>>>> understanding any of the languages represented by CJK, I can't say
>>>>>> with any confidence how good my solution was, but it seemed to be good
>>>>>> enough for the use case of my management and customers in the impacted
>>>>>> regions.
>>>>>> _______________________________________________
>>>>>> sqlite-users mailing list
>>>>>> [hidden email]
>>>>>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>>>>>
>>>>> I am Chinese and I know a little bit of Korean, I can help to test your product :D  All Jokes aside I also tried to build an ICU SQlite macOS version but I failed. All the document I googled seem outdated. e.g. I used this (and other solutions) but I just can not build a macOS version. Do you have any experience for that ?
>>>>>
>>>>>
>>>>> ./configure CFLAGS="-I/usr/local/opt/icu4c/include -DSQLITE_ENABLE_ICU `icu-config --cppflags`" LDFLAGS="-L/usr/local/opt/icu4c/lib `icu-config --ldflags`"
>>>>
>>>> Can you post the complete output of the failed build attempt? Thanks.
>>>>
>>>> Dan.
>>>>
>>> _______________________________________________
>>> sqlite-users mailing list
>>> [hidden email]
>>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>>>
>>
>> _______________________________________________
>> sqlite-users mailing list
>> [hidden email]
>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

Jens Alfke-2
In reply to this post by 邱朗


> On Sep 20, 2018, at 11:01 PM, 邱朗 <[hidden email]> wrote:
>
> https://www.sqlite.org/fts5.html <https://www.sqlite.org/fts5.html> said " The unicode tokenizer classifies all unicode characters as either "separator" or "token" characters. By default all space and punctuation characters, as defined by Unicode 6.1, are considered separators, and all other characters as token characters... "  I really doubt unicode tokenizer requires white space, that is ascii tokenizer.

Detecting word breaks in many East Asian languages (not just CJK; Thai is another) is a rather difficult task and requires having a non-small database of character sequences to match. I’m sure the SQLite maintainers considered it too large to build into their Unicode tokenizer.

IIRC, ICU can do this, as can special libraries like Mecab.

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

邱朗
In reply to this post by Dan Kennedy-4
Hi,


It was exactly like you said, my bad, so now I have built an icu version. BUT unfortunately it still does not support CJK, why is that ?


qiulangs-MacBook-Pro:sqlite-autoconf-3250100 qiulang$ ./sqlite3
SQLite version 3.25.1 2018-09-18 20:20:44
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> CREATE VIRTUAL TABLE zh_text USING fts4(text, tokenize=icu zh_CN);
sqlite> INSERT INTO zh_text values('为什么不支持中文 icu does not seem to work for chinese');
sqlite> select * from zh_text where text match 'work';
为什么不支持中文 icu does not seem to work for chinese
sqlite> select * from zh_text where text match '中';
sqlite>


BTW, whoever hit the icu4c error it may be because you make the same mistake as I did. So I first run brew link icu4c, but brew refused, "Warning: Refusing to link macOS-provided software: icu4c", then I forgot to add it to my path :$


If you run brew info icu4c, it will tell you that but actually I didn't set them and compiler still can find them


For compilers to find icu4c you may need to set:
  export LDFLAGS="-L/usr/local/opt/icu4c/lib"
  export CPPFLAGS="-I/usr/local/opt/icu4c/include"


Thanks,
Qiulang
At 2018-09-21 23:43:01, "Dan Kennedy" <[hidden email]> wrote:
>On 09/21/2018 09:44 PM, 邱朗 wrote:
>> I actually first used  ./configure CFLAGS="-DSQLITE_ENABLE_ICU `icu-config --cppflags`" LDFLAGS="`icu-config --ldflags`"  But I got the error
>
>When you ran this configure command, is the first line out output
>something like the following?
>
>   bash: icu-config: command not found
>

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

Hideaki Takahashi
Hello,

full text search index can be used to see how the text is tokenized for
both FTS4 and FTS5.
for FTS4, fts3tokenize can be used too.

sqlite> CREATE VIRTUAL TABLE icu_zh_cn USING fts3tokenize(icu, zh_CN);
sqlite> SELECT token, start, end, position FROM icu_zh_cn WHERE
INPUT='为什么不支持中文 fts5 does not seem to work for chinese';
为什么|0|9|0
不|9|12|1
支持|12|18|2
中文|18|24|3
fts5|25|29|4
does|30|34|5
not|35|38|6
seem|39|43|7
to|44|46|8
work|47|51|9
for|52|55|10
chinese|56|63|11

based on the output above, the query below works and makes sense to me.
sqlite> select * from zh_text where text match '中文';
为什么不支持中文 icu does not seem to work for chinese


FTS5 + unicode61
sqlite> CREATE VIRTUAL TABLE ft5_test USING fts5(content, tokenize =
'porter unicode61 remove_diacritics 1');
sqlite> INSERT INTO ft5_test values('为什么不支持中文 fts5 does not seem to work
for chinese');
sqlite> CREATE VIRTUAL TABLE ft5_test_vocab_i USING fts5vocab(ft5_test,
'instance');
sqlite> SELECT term, doc, col, offset FROM ft5_test_vocab_i;
(snip non-Chinese portion)
为什么不支持中文|1|content|0

FTS4 + ICU(zh_CN)
sqlite> CREATE VIRTUAL TABLE zh_text USING fts4(text, tokenize=icu zh_CN);
sqlite> INSERT INTO zh_text values('为什么不支持中文 icu does not seem to work for
chinese');
sqlite> CREATE VIRTUAL TABLE zh_terms USING fts4aux(zh_text);
sqlite> SELECT term, col, documents FROM zh_terms;
(snip non-Chinese portion)
不|*|1
不|0|1
中文|*|1
中文|0|1
为什么|*|1
为什么|0|1
支持|*|1
支持|0|1

Thanks,
Hideaki

On Sat, Sep 22, 2018 at 12:44 AM 邱朗 <[hidden email]> wrote:

> Hi,
>
>
> It was exactly like you said, my bad, so now I have built an icu version.
> BUT unfortunately it still does not support CJK, why is that ?
>
>
> qiulangs-MacBook-Pro:sqlite-autoconf-3250100 qiulang$ ./sqlite3
> SQLite version 3.25.1 2018-09-18 20:20:44
> Enter ".help" for usage hints.
> Connected to a transient in-memory database.
> Use ".open FILENAME" to reopen on a persistent database.
> sqlite> CREATE VIRTUAL TABLE zh_text USING fts4(text, tokenize=icu zh_CN);
> sqlite> INSERT INTO zh_text values('为什么不支持中文 icu does not seem to work for
> chinese');
> sqlite> select * from zh_text where text match 'work';
> 为什么不支持中文 icu does not seem to work for chinese
> sqlite> select * from zh_text where text match '中';
> sqlite>
>
>
> BTW, whoever hit the icu4c error it may be because you make the same
> mistake as I did. So I first run brew link icu4c, but brew refused,
> "Warning: Refusing to link macOS-provided software: icu4c", then I forgot
> to add it to my path :$
>
>
> If you run brew info icu4c, it will tell you that but actually I didn't
> set them and compiler still can find them
>
>
> For compilers to find icu4c you may need to set:
>   export LDFLAGS="-L/usr/local/opt/icu4c/lib"
>   export CPPFLAGS="-I/usr/local/opt/icu4c/include"
>
>
> Thanks,
> Qiulang
> At 2018-09-21 23:43:01, "Dan Kennedy" <[hidden email]> wrote:
> >On 09/21/2018 09:44 PM, 邱朗 wrote:
> >> I actually first used  ./configure CFLAGS="-DSQLITE_ENABLE_ICU
> `icu-config --cppflags`" LDFLAGS="`icu-config --ldflags`"  But I got the
> error
> >
> >When you ran this configure command, is the first line out output
> >something like the following?
> >
> >   bash: icu-config: command not found
> >
>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Why sqlite fts5 Unicode61 Tokenizer does not support CJK(Chinese Japanese Krean)?

邱朗
Hi Hideaki,

Thanks for your reply which made me figure out why I said icu version does "not" support Chinese: b/c in Chinese '中文' can be tokenize as either '中文' or '中' or '文' so when query '中文' or '中*' I can get the result but no result when query '文'. The same goes to '为什么', which can be be tokenize as either '为什么' or '为' or '什么' so no result for when query '什么'

And sadly fts5+unicode 61 definitely does not support Chinese.



BTW, it also helps me realize that I had answered this question myself at 2014 here, https://stackoverflow.com/a/31396975/301513. So basically icu does the same as iOS CFStringTokenizer


Qiulang

在 2018-09-22 22:49:24,"Hideaki Takahashi" <[hidden email]> 写道:

>Hello,
>
>full text search index can be used to see how the text is tokenized for
>both FTS4 and FTS5.
>for FTS4, fts3tokenize can be used too.
>
>sqlite> CREATE VIRTUAL TABLE icu_zh_cn USING fts3tokenize(icu, zh_CN);
>sqlite> SELECT token, start, end, position FROM icu_zh_cn WHERE
>INPUT='为什么不支持中文 fts5 does not seem to work for chinese';
>为什么|0|9|0
>不|9|12|1
>支持|12|18|2
>中文|18|24|3
>fts5|25|29|4
>does|30|34|5
>not|35|38|6
>seem|39|43|7
>to|44|46|8
>work|47|51|9
>for|52|55|10
>chinese|56|63|11
>
>based on the output above, the query below works and makes sense to me.
>sqlite> select * from zh_text where text match '中文';
>为什么不支持中文 icu does not seem to work for chinese
>
>
>FTS5 + unicode61
>sqlite> CREATE VIRTUAL TABLE ft5_test USING fts5(content, tokenize =
>'porter unicode61 remove_diacritics 1');
>sqlite> INSERT INTO ft5_test values('为什么不支持中文 fts5 does not seem to work
>for chinese');
>sqlite> CREATE VIRTUAL TABLE ft5_test_vocab_i USING fts5vocab(ft5_test,
>'instance');
>sqlite> SELECT term, doc, col, offset FROM ft5_test_vocab_i;
>(snip non-Chinese portion)
>为什么不支持中文|1|content|0
>
>FTS4 + ICU(zh_CN)
>sqlite> CREATE VIRTUAL TABLE zh_text USING fts4(text, tokenize=icu zh_CN);
>sqlite> INSERT INTO zh_text values('为什么不支持中文 icu does not seem to work for
>chinese');
>sqlite> CREATE VIRTUAL TABLE zh_terms USING fts4aux(zh_text);
>sqlite> SELECT term, col, documents FROM zh_terms;
>(snip non-Chinese portion)
>不|*|1
>不|0|1
>中文|*|1
>中文|0|1
>为什么|*|1
>为什么|0|1
>支持|*|1
>支持|0|1
>
>Thanks,
>Hideaki
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users