finding pairwise similarity

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

finding pairwise similarity

Puneet Kishor-2
I have two tables like so

    CREATE TABLE c (c_no INTEGER PRIMARY KEY, c_name TEXT);
    CREATE TABLE t (t_no INTEGER PRIMARY KEY, t_name TEXT);
    CREATE TABLE c_t (c_no INTEGER, t_no INTEGER);
   
Every row in `c` has one or more `t`. I want, in a single SQL, the following

    source_c, target_c, similarity
   
where `similarity` is a measure of overlap of `t` between each pair of `c` and is calculated as the number of `t` common between a given pair divided by the total number of unique `t` in the pair. For example, given the following records

    c   t
    --- ---
    1   1
    1   2
    1   3
    2   1
    2   5
   
I want the result to be

    source_c target_c similarity
    -------- -------- ----------
    1           2     0.2 (= 1 common `t` between the 2 `c` / by total 4 `t`)

Would appreciate any nudge toward a solution.

--
Puneet Kishor
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: finding pairwise similarity

Igor Tandetnik
Mr. Puneet Kishor <[hidden email]> wrote:

> I have two tables like so
>
>    CREATE TABLE c (c_no INTEGER PRIMARY KEY, c_name TEXT);
>    CREATE TABLE t (t_no INTEGER PRIMARY KEY, t_name TEXT);
>    CREATE TABLE c_t (c_no INTEGER, t_no INTEGER);
>
> Every row in `c` has one or more `t`. I want, in a single SQL, the following
>
>    source_c, target_c, similarity
>
> where `similarity` is a measure of overlap of `t` between each pair of `c` and is calculated as the number of `t` common between
> a given pair divided by the total number of unique `t` in the pair.

select src.c_no source_c, trg.c_no target_c,
(select count(*) from (
 select t_no from c_t where c_t.c_no = src.c_no
 intersect
 select t_no from c_t where c_t.c_no = trg.c_no)) * 1.0
/
(select count(*) from (
 select t_no from c_t where c_t.c_no = src.c_no
 union
 select t_no from c_t where c_t.c_no = trg.c_no)) similarity
from c src join c trg on (src.c_no < trg.c_no);

--
Igor Tandetnik

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: finding pairwise similarity

Jean-Christophe Deschamps-3
In reply to this post by Puneet Kishor-2


>where `similarity` is a measure of overlap of `t` between each pair of
>`c` and is calculated as the number of `t` common between a given pair
>divided by the total number of unique `t` in the pair.

Just like Igor, I take "unique `t` in pair" to mean "distinct `t` in pair".


>I want the result to be
>
>     source_c target_c similarity
>     -------- -------- ----------
>     1           2     0.2 (= 1 common `t` between the 2 `c` / by
> total 4 `t`)

Here I take 0.2 to mean 0.25.


If so, here's some more amunition:

select src.c_no "Source c",
        trg.c_no "Target c",
        (select count(*) from (select 1 from c_t where c_t.c_no in
(src.c_no, trg.c_no) group by c_t.t_no having count(*) > 1))
        * 1.0 /
        (select count(*) from (select distinct t_no from c_t where
c_t.c_no in (src.c_no, trg.c_no))) Similarity
        from c src join c trg on (src.c_no < trg.c_no);

If I'm not mistaken, both versions should give identical
results.  Check by yourself on real data and see which is faster for
you (most probably Igor's).

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users