Extending Porter Tokenizer

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Extending Porter Tokenizer

Abhinav Upadhyay-3
Hi,

I'm wondering if it is possible to extend the functionality of the
porter tokenizer. I would like to use the functionality of the Porter
tokenizer but before stemming the token, I want to decide whether the
token should be stemmed or not.

Do I need to copy the Porter tokenizer and modify it to suit my needs
or there is a better way, to minimize code duplication?

-
Abhinav
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Extending Porter Tokenizer

Matthias-Christian Ott
On 2016-07-05 18:11, Abhinav Upadhyay wrote:
> I'm wondering if it is possible to extend the functionality of the
> porter tokenizer. I would like to use the functionality of the Porter
> tokenizer but before stemming the token, I want to decide whether the
> token should be stemmed or not.
>
> Do I need to copy the Porter tokenizer and modify it to suit my needs
> or there is a better way, to minimize code duplication?

The first argument of the Porter tokenizer is its parent tokenizer. The
Porter tokenizer calls the parent tokenizer's xTokenize function with an
xToken function that wraps the xToken function that was passed to the
xTokenize function of the Porter tokenizer and stems the tokens passed
to it. So create a custom tokenizer that extracts the original xToken
function from the xToken member of its pCtx parameter:

typedef struct PorterContext PorterContext;
struct PorterContext {
  void *pCtx;
  int (*xToken)(void *pCtx, int tflags, const char *pToken, int nToken,
      int iStart, int iEnd);
  char *aBuf;
};

typedef struct CustomTokenizer CustomTokenizer;
struct CustomTokenizer {
  fts5_tokenizer tokenizer;
  Fts5Tokenizer *pTokenizer;
};

typedef struct CustomContext CustomContext;
struct CustomContext {
  void *pCtx;
  int (*xToken)(void *pCtx, int tflags, const char *pToken, int nToken,
      int iStart, int iEnd);
};

int customToken(
  void *pCtx,
  int tflags,
  const char *pToken,
  int nToken,
  int iStart,
  int iEnd
){
  CustomContext *c = (CustomContext*)pCtx;
  PorterContext *p;

  if( stem ){
    c->xToken(c->pCtx, tflags, pToken, nToken, iStart, iEnd);
  }else{
    p = (PorterContext)c->pCtx;
    return p->xToken(p->pCtx, tflags, pToken, nToken, iStart, iEnd);
  }
}

int customTokenize(
  Fts5Tokenizer *pTokenizer,
  void *pCtx,
  int flags,
  const char *pText,
  int nText,
  int (*xToken)(void *, int, const char *, int nToken, int iStart,
      int iEnd)
){
  CustomTokenizer *t = (CustomTokenizer)pTokenizer;
  CustomContext sCtx;
  sCtx.pCtx = pCtx;
  sCtx.xToken = xToken;
  return t->tokenizer.xTokenize(t->pTokenizer, (void*)&sCtx, flags,
      pText, nText, customToken);
}

Note that you are accessing an internal struct and relying on
implementation details and therefore have check whether the struct or
any other relevant implementation details changed with every release.

- Matthias-Christian
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Extending Porter Tokenizer

Abhinav Upadhyay-3
On Fri, Jul 8, 2016 at 3:01 AM, Matthias-Christian Ott <[hidden email]> wrote:

> On 2016-07-05 18:11, Abhinav Upadhyay wrote:
>> I'm wondering if it is possible to extend the functionality of the
>> porter tokenizer. I would like to use the functionality of the Porter
>> tokenizer but before stemming the token, I want to decide whether the
>> token should be stemmed or not.
>>
>> Do I need to copy the Porter tokenizer and modify it to suit my needs
>> or there is a better way, to minimize code duplication?
>
> The first argument of the Porter tokenizer is its parent tokenizer. The
> Porter tokenizer calls the parent tokenizer's xTokenize function with an
> xToken function that wraps the xToken function that was passed to the
> xTokenize function of the Porter tokenizer and stems the tokens passed
> to it. So create a custom tokenizer that extracts the original xToken
> function from the xToken member of its pCtx parameter:
>
> typedef struct PorterContext PorterContext;
> struct PorterContext {
>   void *pCtx;
>   int (*xToken)(void *pCtx, int tflags, const char *pToken, int nToken,
>       int iStart, int iEnd);
>   char *aBuf;
> };
>
> typedef struct CustomTokenizer CustomTokenizer;
> struct CustomTokenizer {
>   fts5_tokenizer tokenizer;
>   Fts5Tokenizer *pTokenizer;
> };
>
> typedef struct CustomContext CustomContext;
> struct CustomContext {
>   void *pCtx;
>   int (*xToken)(void *pCtx, int tflags, const char *pToken, int nToken,
>       int iStart, int iEnd);
> };
>
> int customToken(
>   void *pCtx,
>   int tflags,
>   const char *pToken,
>   int nToken,
>   int iStart,
>   int iEnd
> ){
>   CustomContext *c = (CustomContext*)pCtx;
>   PorterContext *p;
>
>   if( stem ){
>     c->xToken(c->pCtx, tflags, pToken, nToken, iStart, iEnd);
>   }else{
>     p = (PorterContext)c->pCtx;
>     return p->xToken(p->pCtx, tflags, pToken, nToken, iStart, iEnd);
>   }
> }
>
> int customTokenize(
>   Fts5Tokenizer *pTokenizer,
>   void *pCtx,
>   int flags,
>   const char *pText,
>   int nText,
>   int (*xToken)(void *, int, const char *, int nToken, int iStart,
>       int iEnd)
> ){
>   CustomTokenizer *t = (CustomTokenizer)pTokenizer;
>   CustomContext sCtx;
>   sCtx.pCtx = pCtx;
>   sCtx.xToken = xToken;
>   return t->tokenizer.xTokenize(t->pTokenizer, (void*)&sCtx, flags,
>       pText, nText, customToken);
> }
>
> Note that you are accessing an internal struct and relying on
> implementation details and therefore have check whether the struct or
> any other relevant implementation details changed with every release.

Thanks for the detailed response. I think this would work but we are
currently using FTS4. The ability of calling a parent tokenizer is
really what I needed, but I don't think this is possible with FTS4?

-
Abhinav
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Extending Porter Tokenizer

Dan Kennedy-4
On 07/10/2016 01:33 PM, Abhinav Upadhyay wrote:

> On Fri, Jul 8, 2016 at 3:01 AM, Matthias-Christian Ott <[hidden email]> wrote:
>> On 2016-07-05 18:11, Abhinav Upadhyay wrote:
>>> I'm wondering if it is possible to extend the functionality of the
>>> porter tokenizer. I would like to use the functionality of the Porter
>>> tokenizer but before stemming the token, I want to decide whether the
>>> token should be stemmed or not.
>>>
>>> Do I need to copy the Porter tokenizer and modify it to suit my needs
>>> or there is a better way, to minimize code duplication?
>> The first argument of the Porter tokenizer is its parent tokenizer. The
>> Porter tokenizer calls the parent tokenizer's xTokenize function with an
>> xToken function that wraps the xToken function that was passed to the
>> xTokenize function of the Porter tokenizer and stems the tokens passed
>> to it. So create a custom tokenizer that extracts the original xToken
>> function from the xToken member of its pCtx parameter:
>>
>> typedef struct PorterContext PorterContext;
>> struct PorterContext {
>>    void *pCtx;
>>    int (*xToken)(void *pCtx, int tflags, const char *pToken, int nToken,
>>        int iStart, int iEnd);
>>    char *aBuf;
>> };
>>
>> typedef struct CustomTokenizer CustomTokenizer;
>> struct CustomTokenizer {
>>    fts5_tokenizer tokenizer;
>>    Fts5Tokenizer *pTokenizer;
>> };
>>
>> typedef struct CustomContext CustomContext;
>> struct CustomContext {
>>    void *pCtx;
>>    int (*xToken)(void *pCtx, int tflags, const char *pToken, int nToken,
>>        int iStart, int iEnd);
>> };
>>
>> int customToken(
>>    void *pCtx,
>>    int tflags,
>>    const char *pToken,
>>    int nToken,
>>    int iStart,
>>    int iEnd
>> ){
>>    CustomContext *c = (CustomContext*)pCtx;
>>    PorterContext *p;
>>
>>    if( stem ){
>>      c->xToken(c->pCtx, tflags, pToken, nToken, iStart, iEnd);
>>    }else{
>>      p = (PorterContext)c->pCtx;
>>      return p->xToken(p->pCtx, tflags, pToken, nToken, iStart, iEnd);
>>    }
>> }
>>
>> int customTokenize(
>>    Fts5Tokenizer *pTokenizer,
>>    void *pCtx,
>>    int flags,
>>    const char *pText,
>>    int nText,
>>    int (*xToken)(void *, int, const char *, int nToken, int iStart,
>>        int iEnd)
>> ){
>>    CustomTokenizer *t = (CustomTokenizer)pTokenizer;
>>    CustomContext sCtx;
>>    sCtx.pCtx = pCtx;
>>    sCtx.xToken = xToken;
>>    return t->tokenizer.xTokenize(t->pTokenizer, (void*)&sCtx, flags,
>>        pText, nText, customToken);
>> }
>>
>> Note that you are accessing an internal struct and relying on
>> implementation details and therefore have check whether the struct or
>> any other relevant implementation details changed with every release.
> Thanks for the detailed response. I think this would work but we are
> currently using FTS4. The ability of calling a parent tokenizer is
> really what I needed, but I don't think this is possible with FTS4?

No way to do that with FTS4 unfortunately. I think you'll either need to
switch to FTS5 or make a copy of the porter stemmer code and modify it
to suit your purpose.

Dan.



>
> -
> Abhinav
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users