Speed up count(distinct col)

classic Classic list List threaded Threaded
42 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Speed up count(distinct col)

Yuzem
Hi.
I have this tables:
CREATE TABLE movies(movie_id INTEGER, title TEXT, unique(movie_id));
CREATE TABLE tags(movie_id INTEGER, tag TEXT, unique(movie_id,tag));

I have many movies by tag and many tables like "tags" (keywords, countries, languages, genres, etc..)
I can count different movies very fast with:
SELECT count(*) FROM movies;

But the other tables are much slower depending on the size of the table:
SELECT count(distinct tag) FROM tags;

My solution is to create an additional table "count_tags" and then every time a tag is added to table "tags" a trigger adds the tag to "count_tags", I need also another trigger to remove the tag, then I can do:
SELECT count(*) FROM count_tags;

This solution implies one additional table and two triggers by each table.

I wanted to know if there is a simpler solution, maybe by using indexes.
Thanks in advance.
Reply | Threaded
Open this post in threaded view
|

Re: EXT : Speed up count(distinct col)

Black, Michael (IS)
Test#1
create index tags_index on tags(tag);
You should have an index for any fields you query on like this.

Test#2
Normalize your tags into a separate table so you then store the rowid in your tags table instead of the string.  Your compares will be notably faster using integers rather than strings for more fancy queries.

Michael D. Black
Senior Scientist
NG Information Systems
Advanced Analytics Directorate



________________________________________
From: [hidden email] [[hidden email]] on behalf of Yuzem [[hidden email]]
Sent: Monday, February 07, 2011 9:26 AM
To: [hidden email]
Subject: EXT :[sqlite]  Speed up count(distinct col)

Hi.
I have this tables:
CREATE TABLE movies(movie_id INTEGER, title TEXT, unique(movie_id));
CREATE TABLE tags(movie_id INTEGER, tag TEXT, unique(movie_id,tag));

I have many movies by tag and many tables like "tags" (keywords, countries,
languages, genres, etc..)
I can count different movies very fast with:
SELECT count(*) FROM movies;

But the other tables are much slower depending on the size of the table:
SELECT count(distinct tag) FROM tags;

My solution is to create an additional table "count_tags" and then every
time a tag is added to table "tags" a trigger adds the tag to "count_tags",
I need also another trigger to remove the tag, then I can do:
SELECT count(*) FROM count_tags;

This solution implies one additional table and two triggers by each table.

I wanted to know if there is a simpler solution, maybe by using indexes.
Thanks in advance.
--
View this message in context: http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30864622.html
Sent from the SQLite mailing list archive at Nabble.com.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Speed up count(distinct col)

Petite Abeille-2
In reply to this post by Yuzem

On Feb 7, 2011, at 4:26 PM, Yuzem wrote:

>
> I have many movies by tag and many tables like "tags" (keywords, countries,
> languages, genres, etc..)

Hey... sounds like IMdb :P

> This solution implies one additional table and two triggers by each table.

To paraphrase:

Some people, when confronted with a problem, think “I know, I'll use triggers.”   Now they have two problems.

In any case, as mentioned by Michael Black, you might benefit greatly by normalizing your table and indexing the foreign key:

http://dev.alt.textdrive.com/browser/IMDB/IMDB.ddl#L401

Don't bother with the triggers.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: EXT : Speed up count(distinct col)

Yuzem
In reply to this post by Black, Michael (IS)

Black, Michael (IS) wrote
Test#1
create index tags_index on tags(tag);
You should have an index for any fields you query on like this.
Thanks Michael but I don't see any speed improvement:
create index test on tags(tag);
select count(distinct tag) from tags;

This is much faster:
select count(*) from tags;

Am I doing something wrong?
Reply | Threaded
Open this post in threaded view
|

Re: EXT : Speed up count(distinct col)

Eric Rubin-Smith
Yuzem wrote:

> >
> > Test#1
> > create index tags_index on tags(tag);
> > You should have an index for any fields you query on like this.
> >
>
> Thanks Michael but I don't see any speed improvement:
> create index test on tags(tag);
> select count(distinct tag) from tags;
>
> This is much faster:
> select count(*) from tags;
>
> Am I doing something wrong?

As Michael mentioned, you might be getting killed by string comparisons.

I'm no SQL expert, so gurus are welcome to add to the following without
insulting me.

To beef up Michael's suggestion, try something like this:

CREATE TABLE movies(movie_id INTEGER, title TEXT, unique(movie_id));

(BTW, you should consider making movie_id "INTEGER PRIMARY KEY" and then
remove "unique(movie_id)" -- as long as that otherwise makes sense for
your architecture.  For SQLite-specific reasons that will probably be
faster.)

CREATE TABLE tagNames(tagId INTEGER PRIMARY KEY, tagName TEXT UNIQUE);
CREATE INDEX tagNamesIdxtagName on tagNames(tagName);
CREATE TABLE tags(movie_id INTEGER,
                  tagId INTEGER REFERENCES tagNames,
                  unique(movie_id,tag));

CREATE INDEX tagsIdxTagId on tags(tagId);

Then see how fast it is to ask
SELECT COUNT(*) FROM tagString;

Note: The index tagNamesIdxtagName is there because you'll probably want it
to speed up insertions into 'tags'.

Note: If foreign key checking is on, be careful about deletes on table
"tagNames".  Without an index on tags(tagId) a delete of a single row on
tagNames implies a full table scan on tags (to make sure there are no
rows referencing it).  That bit me in the past.

Eric

--
Eric A. Smith

The concept is interesting and well-formed, but in order to earn
better than a 'C,' the idea must be feasible.
    -- A Yale University management professor in response to Fred Smith's paper
       proposing reliable overnight delivery service.
       (Smith went on to found Federal Express Corp.)
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Speed up count(distinct col)

Yuzem
In reply to this post by Petite Abeille-2
Petite Abeille-2 wrote
Hey... sounds like IMdb :P
Yes, I'm coding a movie manager that grabs the info from imdb.

Petite Abeille-2 wrote
In any case, as mentioned by Michael Black, you might benefit greatly by normalizing your table and indexing the foreign key:

http://dev.alt.textdrive.com/browser/IMDB/IMDB.ddl#L401
I thought sqlite didn't handle foreign keys correctly, I am using triggers to automatically delete tags when a movie gets deleted.
If I use a foreign key will it automatically remove the tag if the movie gets deleted?
Anyway, to use integers in the "tags" table is very complicated because I will have to assign the corresponding number to each tag that I insert and I have to insert lots of keywords for every movie.
Does sqlite has any function to convert a text string into an unique number?
Reply | Threaded
Open this post in threaded view
|

Re: Speed up count(distinct col)

Eric Rubin-Smith
Yuzem wrote:

> I thought sqlite didn't handle foreign keys correctly, I am using
> triggers to automatically delete tags when a movie gets deleted.  

There's a new thing, 'PRAGMA foreign_keys=1', that causes SQLite to
enforce them.  Check out the docs on that.  

> Anyway, to use integers in the "tags" table is very complicated because
> I will have to assign the corresponding number to each tag that I insert
> and I have to insert lots of keywords for every movie.  

Well, technically you can have SQLite do that assignment for you.  But
yes, you'll have to deal with that mapping somewhere.  You can abstract
that away at a very low level in your infrastructure though.

It's not so terrible: again it's something like (in one high level
language)

set tagId [db one {SELECT tagId FROM tagNames WHERE tagName=$tagName}]
db eval {INSERT INTO tags VALUES($movieId, $tagId)}

That one SELECT lookup will probably save you tons of string comparisons
later on.  Depends of course on the nature of your data set and query
stats.

> Does sqlite has any function to convert a text string into an unique
> number?  

If you run (again correcting for your preferred language)

db eval {INSERT INTO tagNames(tagName) VALUES($tagName)}

then SQLite will generate the tagId column value for you if the column
is declared INTEGER PRIMARY KEY.  Again, you can get it back out at some
point later using SELECT as above, or if it's needed immediately (which
it probably is) you can use last_insert_rowid.

So, again in Tcl, your whole insertion path will probably look something
like:

db eval {INSERT OR IGNORE INTO tagNames(tagName) VALUES($tagName)}
if {[db changes]} {
  set tagId [db last_insert_rowid]
} else {
  set tagId [db one {SELECT tagId FROM tagNames WHERE tagName=$tagName}]
}
assert {$tagId!=""}
db eval {INSERT OR IGNORE INTO tags VALUES($movieId, $tagId)}

That could possibly be optimized further.  But you could give it a go as
a first pass and see how far it takes you.  It's not so complicated, and
in 7 lines (plus 1 sanity-check line) we've abstracted the ID<->text
mapping.

Eric

--
Eric A. Smith

Carperpetuation (kar' pur pet u a shun), n.:
    The act, when vacuuming, of running over a string at least a
    dozen times, reaching over and picking it up, examining it, then
    putting it back down to give the vacuum one more chance.
        -- Rich Hall, "Sniglets"
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Speed up count(distinct col)

Josh Marell
In reply to this post by Yuzem
I would create a tagList table (integer tagID, string tagName,
unique(tagName))

Before performing your batch of inserts, query the tagList table (integer
tagID, string tagName), and generate a map (key on tagName, value of tagID).
For each tag you want to insert, see if it exists in the map.  If it
doesn't, insert first into tagList table and get the tagID of that new entry
and update your map.  Perform your insert with the tagID instead now.  If it
does exist, use the value of the tagName key in your map.

I chose a map (or some similar implementation), because the lookup should be
quick, and you won't need to query the DB for each new tag during your batch
of inserts.

On Mon, Feb 7, 2011 at 2:32 PM, Yuzem <[hidden email]> wrote:

>
>
> Petite Abeille-2 wrote:
> >
> > Hey... sounds like IMdb :P
> >
> Yes, I'm coding a  http://yuzem.blogspot.com/p/figuritas-screenshots.html
> movie manager  that grabs the info from imdb.
>
>
> Petite Abeille-2 wrote:
> >
> > In any case, as mentioned by Michael Black, you might benefit greatly by
> > normalizing your table and indexing the foreign key:
> >
> > http://dev.alt.textdrive.com/browser/IMDB/IMDB.ddl#L401
> >
>
> I thought sqlite didn't handle foreign keys correctly, I am using triggers
> to automatically delete tags when a movie gets deleted.
> If I use a foreign key will it automatically remove the tag if the movie
> gets deleted?
> Anyway, to use integers in the "tags" table is very complicated because I
> will have to assign the corresponding number to each tag that I insert and
> I
> have to insert lots of keywords for every movie.
> Does sqlite has any function to convert a text string into an unique
> number?
>
> --
> View this message in context:
> http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30867411.html
> Sent from the SQLite mailing list archive at Nabble.com.
>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Speed up count(distinct col)

Petite Abeille-2
In reply to this post by Yuzem

On Feb 7, 2011, at 9:32 PM, Yuzem wrote:

>> Hey... sounds like IMdb :P
>>
> Yes, I'm coding a  http://yuzem.blogspot.com/p/figuritas-screenshots.html
> movie manager  that grabs the info from imdb.

A fine hobby :)

> I thought sqlite didn't handle foreign keys correctly,

I thought wrong :)

http://www.sqlite.org/foreignkeys.html

> I am using triggers
> to automatically delete tags when a movie gets deleted.

Don't bother.

> If I use a foreign key will it automatically remove the tag if the movie
> gets deleted?

See above.

> Anyway, to use integers in the "tags" table is very complicated

Nah... it's called normalization... it's a good thing :)

http://en.wikipedia.org/wiki/Database_normalization

> because I will have to assign the corresponding number to each tag that I insert

Yep. It's a good thing.


> and I
> have to insert lots of keywords for every movie.

IMdb contains ~3.5M movie-keyword pairs (~405K distinct movies, 107K distinct keywords). Not much to write home about.


> Does sqlite has any function to convert a text string into an unique number?

You do it the other way around: create a unique entry in your tag table, then use the tag id in the movie_tag table that join movies to tags. Think of it as compression: instead of storing, say, the literal "accidental-cannibalism" multiple times in your movie_tag, you store the literal only once in the tag table, than use the literal id in the join table. And they lived happily ever after third normal form :))


 
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: EXT : Speed up count(distinct col)

Black, Michael (IS)
In reply to this post by Yuzem
Mabye one of our experts can explain why distinct takes to long.
Seems to me if you have an index you're just returning the values in the index, aren't you?
But it looks like it's comparing all keys to all keys.

Do you just want a count of distinct values?  
select count(distinct(tag)) from tags;
I think that will run faster.

Then again...if you normalize your tags you end up with just count(*) from the new tag table to find all unique keys
sqlite> create index tags_index on tags(tag);
sqlite> explain select distinct(tag) from tags;
addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
0     Trace          0     0     0                    00
1     Noop           0     0     0                    00
2     Integer        0     3     0                    00
3     Integer        0     2     0                    00
4     Gosub          5     28    0                    00
5     Goto           0     31    0                    00
6     OpenRead       2     4     0     keyinfo(1,BINARY)  00
7     Rewind         2     18    8     0              00
8     Column         2     0     7                    00
9     Compare        6     7     1     keyinfo(1,BINARY)  00
10    Jump           11    15    11                   00
11    Move           7     6     1                    00
12    Gosub          4     23    0                    00
13    IfPos          3     30    0                    00
14    Gosub          5     28    0                    00
15    Column         2     0     1                    00
16    Integer        1     2     0                    00
17    Next           2     8     0                    00
18    Close          2     0     0                    00
19    Gosub          4     23    0                    00
20    Goto           0     30    0                    00
21    Integer        1     3     0                    00
22    Return         4     0     0                    00
23    IfPos          2     25    0                    00
24    Return         4     0     0                    00
25    Copy           1     9     0                    00
26    ResultRow      9     1     0                    00
27    Return         4     0     0                    00
28    Null           0     1     0                    00
29    Return         5     0     0                    00
30    Halt           0     0     0                    00
31    Transaction    0     0     0                    00
32    VerifyCookie   0     2     0                    00
33    TableLock      0     2     0     tags           00
34    Goto           0     6     0                    00
sqlite> explain select count(*) from tags;
addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
0     Trace          0     0     0                    00
1     Goto           0     8     0                    00
2     OpenRead       1     4     0     keyinfo(1,BINARY)  00
3     Count          1     1     0                    00
4     Close          1     0     0                    00
5     Copy           1     2     0                    00
6     ResultRow      2     1     0                    00
7     Halt           0     0     0                    00
8     Transaction    0     0     0                    00
9     VerifyCookie   0     2     0                    00
10    TableLock      0     2     0     tags           00
11    Goto           0     2     0                    00
sqlite> explain select count(distinct(tag)) from tags;
addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
0     Trace          0     0     0                    00
1     Null           0     2     0                    00
2     Null           0     1     0                    00
3     OpenEphemeral  1     0     0     keyinfo(1,BINARY)  00
4     Goto           0     18    0                    00
5     OpenRead       0     2     0     2              00
6     Rewind         0     13    0                    00
7     Column         0     1     3                    00
8     Found          1     12    3     1              00
9     MakeRecord     3     1     4                    00
10    IdxInsert      1     4     0                    00
11    AggStep        0     3     1     count(1)       01
12    Next           0     7     0                    01
13    Close          0     0     0                    00
14    AggFinal       1     1     0     count(1)       00
15    Copy           1     5     0                    00
16    ResultRow      5     1     0                    00
17    Halt           0     0     0                    00
18    Transaction    0     0     0                    00
19    VerifyCookie   0     2     0                    00
20    TableLock      0     2     0     tags           00
21    Goto           0     5     0                    00

Michael D. Black
Senior Scientist
NG Information Systems
Advanced Analytics Directorate



________________________________________
From: [hidden email] [[hidden email]] on behalf of Yuzem [[hidden email]]
Sent: Monday, February 07, 2011 2:13 PM
To: [hidden email]
Subject: Re: [sqlite] EXT :  Speed up count(distinct col)

Black, Michael (IS) wrote:
>
> Test#1
> create index tags_index on tags(tag);
> You should have an index for any fields you query on like this.
>

Thanks Michael but I don't see any speed improvement:
create index test on tags(tag);
select count(distinct tag) from tags;

This is much faster:
select count(*) from tags;

Am I doing something wrong?
--
View this message in context: http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30867275.html
Sent from the SQLite mailing list archive at Nabble.com.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: EXT : Speed up count(distinct col)

Pavel Ivanov-2
> Mabye one of our experts can explain why distinct takes to long.
> Seems to me if you have an index you're just returning the values in the index, aren't you?

If index is unique then you just count number of values in the index
and that's it. If index is not unique then it contains many non-unique
values. So you have to traverse all values and compare each of them
with the previous one to see whether it's different or not. If you
remove word 'distinct' then it will indeed just return values from the
index.


Pavel

On Mon, Feb 7, 2011 at 4:10 PM, Black, Michael (IS)
<[hidden email]> wrote:

> Mabye one of our experts can explain why distinct takes to long.
> Seems to me if you have an index you're just returning the values in the index, aren't you?
> But it looks like it's comparing all keys to all keys.
>
> Do you just want a count of distinct values?
> select count(distinct(tag)) from tags;
> I think that will run faster.
>
> Then again...if you normalize your tags you end up with just count(*) from the new tag table to find all unique keys
> sqlite> create index tags_index on tags(tag);
> sqlite> explain select distinct(tag) from tags;
> addr  opcode         p1    p2    p3    p4             p5  comment
> ----  -------------  ----  ----  ----  -------------  --  -------------
> 0     Trace          0     0     0                    00
> 1     Noop           0     0     0                    00
> 2     Integer        0     3     0                    00
> 3     Integer        0     2     0                    00
> 4     Gosub          5     28    0                    00
> 5     Goto           0     31    0                    00
> 6     OpenRead       2     4     0     keyinfo(1,BINARY)  00
> 7     Rewind         2     18    8     0              00
> 8     Column         2     0     7                    00
> 9     Compare        6     7     1     keyinfo(1,BINARY)  00
> 10    Jump           11    15    11                   00
> 11    Move           7     6     1                    00
> 12    Gosub          4     23    0                    00
> 13    IfPos          3     30    0                    00
> 14    Gosub          5     28    0                    00
> 15    Column         2     0     1                    00
> 16    Integer        1     2     0                    00
> 17    Next           2     8     0                    00
> 18    Close          2     0     0                    00
> 19    Gosub          4     23    0                    00
> 20    Goto           0     30    0                    00
> 21    Integer        1     3     0                    00
> 22    Return         4     0     0                    00
> 23    IfPos          2     25    0                    00
> 24    Return         4     0     0                    00
> 25    Copy           1     9     0                    00
> 26    ResultRow      9     1     0                    00
> 27    Return         4     0     0                    00
> 28    Null           0     1     0                    00
> 29    Return         5     0     0                    00
> 30    Halt           0     0     0                    00
> 31    Transaction    0     0     0                    00
> 32    VerifyCookie   0     2     0                    00
> 33    TableLock      0     2     0     tags           00
> 34    Goto           0     6     0                    00
> sqlite> explain select count(*) from tags;
> addr  opcode         p1    p2    p3    p4             p5  comment
> ----  -------------  ----  ----  ----  -------------  --  -------------
> 0     Trace          0     0     0                    00
> 1     Goto           0     8     0                    00
> 2     OpenRead       1     4     0     keyinfo(1,BINARY)  00
> 3     Count          1     1     0                    00
> 4     Close          1     0     0                    00
> 5     Copy           1     2     0                    00
> 6     ResultRow      2     1     0                    00
> 7     Halt           0     0     0                    00
> 8     Transaction    0     0     0                    00
> 9     VerifyCookie   0     2     0                    00
> 10    TableLock      0     2     0     tags           00
> 11    Goto           0     2     0                    00
> sqlite> explain select count(distinct(tag)) from tags;
> addr  opcode         p1    p2    p3    p4             p5  comment
> ----  -------------  ----  ----  ----  -------------  --  -------------
> 0     Trace          0     0     0                    00
> 1     Null           0     2     0                    00
> 2     Null           0     1     0                    00
> 3     OpenEphemeral  1     0     0     keyinfo(1,BINARY)  00
> 4     Goto           0     18    0                    00
> 5     OpenRead       0     2     0     2              00
> 6     Rewind         0     13    0                    00
> 7     Column         0     1     3                    00
> 8     Found          1     12    3     1              00
> 9     MakeRecord     3     1     4                    00
> 10    IdxInsert      1     4     0                    00
> 11    AggStep        0     3     1     count(1)       01
> 12    Next           0     7     0                    01
> 13    Close          0     0     0                    00
> 14    AggFinal       1     1     0     count(1)       00
> 15    Copy           1     5     0                    00
> 16    ResultRow      5     1     0                    00
> 17    Halt           0     0     0                    00
> 18    Transaction    0     0     0                    00
> 19    VerifyCookie   0     2     0                    00
> 20    TableLock      0     2     0     tags           00
> 21    Goto           0     5     0                    00
>
> Michael D. Black
> Senior Scientist
> NG Information Systems
> Advanced Analytics Directorate
>
>
>
> ________________________________________
> From: [hidden email] [[hidden email]] on behalf of Yuzem [[hidden email]]
> Sent: Monday, February 07, 2011 2:13 PM
> To: [hidden email]
> Subject: Re: [sqlite] EXT :  Speed up count(distinct col)
>
> Black, Michael (IS) wrote:
>>
>> Test#1
>> create index tags_index on tags(tag);
>> You should have an index for any fields you query on like this.
>>
>
> Thanks Michael but I don't see any speed improvement:
> create index test on tags(tag);
> select count(distinct tag) from tags;
>
> This is much faster:
> select count(*) from tags;
>
> Am I doing something wrong?
> --
> View this message in context: http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30867275.html
> Sent from the SQLite mailing list archive at Nabble.com.
>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: EXT : Speed up count(distinct col)

Yuzem
Ok, thanks all for your answers, I guest I will have to normalize the database and explore the foreign key feature.

One more thing:
I have the tables "movies" and "people"
Those two tables are related by tables "directors", "writers", etc...
movies: id, title
people: id, name

directors: movieId, peopleId
writers: movieId, peopleId
etc...

How can I normalize that so I can count directors and writers using count(*)?
Should I create another table for "directors" and another for "writers", etc... with only the ids?
I will end up having:
movies > movies_directors < (directors people)

Thanks again!
Reply | Threaded
Open this post in threaded view
|

Re: EXT : Speed up count(distinct col)

Simon Slavin-3

On 7 Feb 2011, at 11:19pm, Yuzem wrote:

> One more thing:
> I have the tables "movies" and "people"
> Those two tables are related by tables "directors", "writers", etc...
> movies: id, title
> people: id, name
>
> directors: movieId, peopleId
> writers: movieId, peopleId
> etc...
>
> How can I normalize that so I can count directors and writers using
> count(*)?
> Should I create another table for "directors" and another for "writers",
> etc... with only the ids?
> I will end up having:
> movies > movies_directors < (directors people)


This is very basic database design.  You need to read up not on SQLite or even on SQL, but some books on how to arrange your data in databases.

One solution would include these three tables:

Movies: id, name, releaseDate, etc..

People: id, name, DOB, biog, etc..

MoviePeople: id, movie, person, capacity

Fill your Movies TABLE.  Fill your People TABLE with actors, directors, writers, etc..

Then fill the MoviePeople table with connections between the two: Clint Eastwood worked in one movie as actor, but he worked in another move as director, etc..

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Speed up count(distinct col)

BareFeetWare-3
In reply to this post by Eric Rubin-Smith
You have:

CREATE TABLE movies(movie_id INTEGER, title TEXT, unique(movie_id));
CREATE TABLE tags(movie_id INTEGER, tag TEXT, unique(movie_id,tag));

You can solve your problem, using pure SQL. No need to resort to the application layer. Just execute the SQL transaction below. It takes care of everything, including normalizing, assigning integer IDs to unique tags, filtering our duplicates and auto indexes. I haven't tested on your tables specifically, but I do similar all the time. Any errors should be simple name errors.

begin immediate;
create temp table "Movies Old" as select * from Movies;
create temp table "Tags Old" from Tags;
drop table Movies;
drop table Tags;
create table Movies
( Movie_ID integer primary key not null
, Title text collate no case not null
);
create table Tags
( Tag_ID integer primary key not null
, Name text not null unique collate nocase
);
create table Movie_Tags
( ID integer primary key not null
, Movie_ID integer not null
                references Movies (Movie_ID) on delete cascade
, Tag_ID integer not null
                references Tags (Tag_ID) on delete cascade
, unique (Movie_ID, Tag_ID)
);
insert into Movies (Movie_ID, Title) select Movie_ID, Title from "Movies Old";

insert or ignore into Tags (Name) select distinct Tag from "Tags Old";

insert into Movie_Tags (Movie_ID, Tag_ID)
select distinct Movie_ID, (select Tag_ID from Tags where Name = "Tags Old".Tag)
from "Tags Old";

drop table "Movies Old";
drop table "Tags Old";

commit or rollback;

Note that the last line "commit or rollback" is not an actual SQLite command. You will need to manually decide at that point whether to issue a commit or rollback. If there are any errors in previous lines, use rollback. Unfortunately, SQLite does not automatically rollback all commands in a transaction if one of the commands fails (specifically create and drop commands).

Then you should be able to count tags very fast by:

select count(*) from Tags;

In order to make foreign keys work, you need to have SQLite version 3.6.19 or later, and use this pragma when you open a connection to SQLite (ie before any commands that require use of foreign keys):

pragma foreign_keys = yes;

A normalized database using integer keys is fast, small and elegant. Viewing integers by users is ugly. So for viewing the Movie-Tag combinations (and data entry), you might want to create a view to make it user friendly. It depends on your user interface. Reply here if you need that. Just something like this:

create view "Movie Tags View"
as select
        Movie_Tags.ID as ID
, Movies.Title as "Movie"
, Tags.Name as "Tag"
from Movie_Tags
        join Movies on Movie_Tags.Movie_ID = Movies.Movie_ID
        join Tags on Movie_Tags.Tag_ID = Tags.Tag_ID
;

Hope this helps,
Tom
BareFeetWare

 --
Comparison of SQLite GUI tools:
http://www.barefeetware.com/sqlite/compare/?ml
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: EXT : Speed up count(distinct col)

BareFeetWare-2
In reply to this post by Yuzem
You have:

CREATE TABLE movies(movie_id INTEGER, title TEXT, unique(movie_id));
CREATE TABLE tags(movie_id INTEGER, tag TEXT, unique(movie_id,tag));

You can solve your problem, using pure SQL. No need to resort to the application layer. Just execute the SQL transaction below. It takes care of everything, including normalizing, assigning integer IDs to unique tags, filtering our duplicates and auto indexes. I haven't tested on your tables specifically, but I do similar all the time. Any errors should be simple name errors.

begin immediate;
create temp table "Movies Old" as select * from Movies;
create temp table "Tags Old" from Tags;
drop table Movies;
drop table Tags;
create table Movies
( Movie_ID integer primary key not null
, Title text collate no case not null
);
create table Tags
( Tag_ID integer primary key not null
, Name text not null unique collate nocase
);
create table Movie_Tags
( ID integer primary key not null
, Movie_ID integer not null
                references Movies (Movie_ID) on delete cascade
, Tag_ID integer not null
                references Tags (Tag_ID) on delete cascade
, unique (Movie_ID, Tag_ID)
);
insert into Movies (Movie_ID, Title) select Movie_ID, Title from "Movies Old";

insert or ignore into Tags (Name) select distinct Tag from "Tags Old";

insert into Movie_Tags (Movie_ID, Tag_ID)
select distinct Movie_ID, (select Tag_ID from Tags where Name = "Tags Old".Tag)
from "Tags Old";

drop table "Movies Old";
drop table "Tags Old";

commit or rollback;

Note that the last line "commit or rollback" is not an actual SQLite command. You will need to manually decide at that point whether to issue a commit or rollback. If there are any errors in previous lines, use rollback. Unfortunately, SQLite does not automatically rollback all commands in a transaction if one of the commands fails (specifically create and drop commands).

Then you should be able to count tags very fast by:

select count(*) from Tags;

In order to make foreign keys work, you need to have SQLite version 3.6.19 or later, and use this pragma when you open a connection to SQLite (ie before any commands that require use of foreign keys):

pragma foreign_keys = yes;

A normalized database using integer keys is fast, small and elegant. Viewing integers by users is ugly. So for viewing the Movie-Tag combinations (and data entry), you might want to create a view to make it user friendly. It depends on your user interface. Reply here if you need that. Just something like this:

create view "Movie Tags View"
as select
        Movie_Tags.ID as ID
, Movies.Title as "Movie"
, Tags.Name as "Tag"
from Movie_Tags
        join Movies on Movie_Tags.Movie_ID = Movies.Movie_ID
        join Tags on Movie_Tags.Tag_ID = Tags.Tag_ID
;

Hope this helps,
Tom
BareFeetWare

--
Comparison of SQLite GUI tools:
http://www.barefeetware.com/sqlite/compare/?ml
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: EXT : Speed up count(distinct col)

Yuzem
Hi, thanks a lot for the helpful replies.
Sorry to bother again, but there is still something that hasn't been answered.

Simon Slavin-3 has addressed my question but not exactly what I was asking.
Suppose I have two tables "movies" and "people" and other tables to relate both tables: "directors", "writers", etc...

Simon Slavin-3 told me to create one single table to relate "movies" and "people":
MoviePeople: id, movie, person, capacity

But with that solution I can't perform the fast count(*) that I want.
I would have to do:
SELECT count(distinct person) FROM MoviePeople WHERE capacity = "director";

I want to know if there is any standard solution for normalizing this database so I can do count(*) to count all directors or writers...

Thanks in advance for your help and patience.
Reply | Threaded
Open this post in threaded view
|

Re: Speed up count(distinct col)

BareFeetWare-2
In reply to this post by Yuzem
On 08/02/2011, at 10:19 AM, Yuzem wrote:

> I have the tables "movies" and "people"
> Those two tables are related by tables "directors", "writers", etc...
> movies: id, title
> people: id, name
>
> directors: movieId, peopleId
> writers: movieId, peopleId
> etc...
>
> How can I normalize that so I can count directors and writers using
> count(*)?
> Should I create another table for "directors" and another for "writers",
> etc... with only the ids?
> I will end up having:
> movies > movies_directors < (directors people)

After implementing the schema in my previous post, add this:

begin immediate;
create table People
( ID integer primary key not null
, Name text collate no case not null unique
);
create table Capacity
( ID integer primary key not null
, Name text collate no case not null unique
);
create table "Movie People"
( ID integer primary key not null
, Movie_ID integer not null references Movies (Movie_ID) on delete cascade
, Capacity_ID integer not null references Capacity (ID) on delete cascade
, People_ID integer not null references People (ID) on delete cascade
, unique (Movie_ID, Capacity_ID, People_ID)
);
commit;

Then you can count the directors like this:

select count(distinct People_ID) from "Movie People" join Capacity on "Movie People".Capacity_ID = Capacity.ID where Capacity.Name = 'director';

or:

select count(distinct People_ID) from "Movie People" where Capacity_ID = (select ID from Capacity where Name = 'director');

or you can create a view to alphabetically list each director with a CSV list of their movies, like this:

create view "Directors"
as
select People.Name, group_concat (Movies.Title, ', ')
from "Movie People"
        join Movies on "Movie People".Movie_ID = Movies.Movie_ID
        join Capacity on "Movie People".Capacity_ID = Capacity.ID
        join People on "Movie People".People_ID = People.ID
where Capacity.Name = 'director';
group by People_ID
order by People.Name
;

Tom
BareFeetWare

 --
Comparison of SQLite GUI tools:
http://www.barefeetware.com/sqlite/compare/?ml

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Speed up count(distinct col)

Yuzem
BareFeetWare-2 wrote
Then you can count the directors like this:

select count(distinct People_ID) from "Movie People" join Capacity on "Movie People".Capacity_ID = Capacity.ID where Capacity.Name = 'director';

or:

select count(distinct People_ID) from "Movie People" where Capacity_ID = (select ID from Capacity where Name = 'director');

or you can create a view [...]
Thanks you but what I want to do is to count without using count(distinct col) because it is much slower than count(*).
In the previous examples about tags I can do it, I have "movies" and "tags" related by "movies_tags"
and I can do:
SELECT count(*) FROM tags;
I want to know if there is any standard approach to do that.
The only thing I can think of is to have additional tables for the ids of all directors, all writers, etc...
Tables "movies", "people", "capacity" and then tables "directors", "writers", etc.. with only the IDs so I can count using count(*) which is super fast.
Reply | Threaded
Open this post in threaded view
|

Re: Speed up count(distinct col)

Simon Slavin-3

On 10 Feb 2011, at 4:00pm, Yuzem wrote:

> Thanks you but what I want to do is to count without using count(distinct
> col) because it is much slower than count(*).

How much slower.  Did you make an index SQLite could use for that query ?  A good index for that might be

People_ID,Capacity_ID

but it might be faster with those two the other way around.  Try them both out and see which gives you faster results.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Speed up count(distinct col)

BareFeetWare-2
In reply to this post by Yuzem
> Thanks you but what I want to do is to count without using count(distinct col) because it is much slower than count(*).

I think you'll find the schema I posted very fast, since it's running everything, including distinct, on primary key columns.

Or you can try this:

select count(*) from People
where ID in
( select People_ID from "Movie People" where Capacity_ID =
                (select ID from Capacity where Name = 'director')
);

But I expect you'll get the same performance.

You don't want to denormalize and have separate director and writer tables etc. That will get ugly.

Tom
BareFeetWare

 --
Comparison of SQLite GUI tools:
http://www.barefeetware.com/sqlite/compare/?ml

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
123