Is the first column of a composite primary key, special?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Is the first column of a composite primary key, special?

Yannick Duchêne
Hi all,

Another mystery to me. Given this test table:

    CREATE TABLE t (a TEXT, b TEXT, c INTEGER, PRIMARY KEY (a, b, c))

… this query:

    SELECT Sum(c) FROM t GROUP BY a

… executes faster than any of these two:

    SELECT Sum(c) FROM t GROUP BY b
    SELECT Sum(c) FROM t GROUP BY c

… which executes in about the same time together, proportionally to the number of returned rows. With `GROUP BY a`, execution times seems to be about half than with the two formers. Adding or not adding a `WITHOUT ROWID` gives the same. I give the number of rows, to show if the first one is faster than the second one, that's not because it would returns less rows, on the opposite, it returns a bit more then with grouping by `b`:

 * Grouping by `a` results into 1360 rows in about 40ms +/-3;
 * Grouping by `b` results into 1170 rows in about 65ms +/-5;
 * Grouping by `c` results into 3154 rows in about 90ms +/-4.

If the primary key declaration is removed, timing when grouping by `b` or `c` does not change, while timing when grouping by `a` become the same as with the two formers.

I feel to witness this with both SQlite3 CLI and SQLiteBrowser (a detail I must mention after another thread).

Is there any thing special with the first column of a composite primary key? From an implementation point of view, this may makes sense, but I still prefer to ask.

I first noticed this another way. This test was just to check on a simpler case.

Initially, I indirectly noticed this with something similar to this:

    SELECT b, Sum(c) AS c
      FROM
       (SELECT b, Sum(c) AS c
        FROM t
        GROUP BY a, b)
      GROUP BY b -- 60 ms on average

… being faster than this second simpler alternative, something I notice with the test table too, just that the difference is less:

    SELECT b, Sum(c) AS c
      FROM t
      GROUP BY b -- 65 to 70 ms on average

… although the first one seems to run more operations, and it's still the same if I add an index on `b` for the second alternative and thus it does not use a temporary B‑tree for grouping.


I also noticed some other cases where queries executes faster on the first column of a composite key (with or without indexes), but I won't expose all cases, as I'm already too lengthy.


Enough testing for now, I will resume the investigations on this unexpected results, later.


--
Yannick Duchêne
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Is the first column of a composite primary key, special?

Igor Tandetnik-2
On 1/31/2016 8:54 AM, Yannick Duchêne wrote:

> Another mystery to me. Given this test table:
>
>      CREATE TABLE t (a TEXT, b TEXT, c INTEGER, PRIMARY KEY (a, b, c))
>
> … this query:
>
>      SELECT Sum(c) FROM t GROUP BY a
>
> … executes faster than any of these two:
>
>      SELECT Sum(c) FROM t GROUP BY b
>      SELECT Sum(c) FROM t GROUP BY c

Imagine a phone directory book sorted by last name then first name. It's
easy to find all people named "Smith, John", as well as all people with
the last name of Smith. But the book's organization is of no help in
finding all people with the first name of John - you have no choice but
to scan the whole book.

Same with a database index (whether implicitly created for a primary
key, or otherwise) - it only helps when the condition involves some
prefix of the list of columns. In your example, it would help with
grouping (or sorting, or filtering) on (a), or (a, b), or (a, b, c).
--
Igor Tandetnik

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Is the first column of a composite primary key, special?

Dominique Devienne
In reply to this post by Yannick Duchêne
On Sun, Jan 31, 2016 at 2:54 PM, Yannick Duchêne <[hidden email]>
wrote:

> Another mystery to me. Given this test table:
>
>     CREATE TABLE t (a TEXT, b TEXT, c INTEGER, PRIMARY KEY (a, b, c))
>
> … this query:
>
>     SELECT Sum(c) FROM t GROUP BY a
>
> … executes faster than any of these two:
>
>     SELECT Sum(c) FROM t GROUP BY b
>     SELECT Sum(c) FROM t GROUP BY c
>
> … which executes in about the same time together, proportionally to the
> number of returned rows. With `GROUP BY a`, execution times seems to be
> about half than with the two formers. Adding or not adding a `WITHOUT
> ROWID` gives the same. I give the number of rows, to show if the first one
> is faster than the second one, that's not because it would returns less
> rows, on the opposite, it returns a bit more then with grouping by `b`:
>
>  * Grouping by `a` results into 1360 rows in about 40ms +/-3;
>  * Grouping by `b` results into 1170 rows in about 65ms +/-5;
>  * Grouping by `c` results into 3154 rows in about 90ms +/-4.
>

Run "explain query plan ..." [1] to see which plan SQLite chooses for your
different queries.

As [2] says "indexes are only useful if there are WHERE-clause constraints
on the left-most columns of the index",
with "left-most" being the keyword here. `a` is your left-most column, so
it works best with it.

But grouping by `b` may also benefits from the index, thanks to the
"skip-scan" optimization. (see [2] again).

if there are few distinct `a` values, the benefit of skip-scan improves.

The NGQP [3] is more sensitive than the old planner to good statistics, to
find the optimum plan.
So do compare you plans and performance before and after running ANALYZE on
your table(s).

[1] https://www.sqlite.org/eqp.html
[2] https://www.sqlite.org/optoverview.html#skipscan
[3] https://www.sqlite.org/queryplanner-ng.html
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Is the first column of a composite primary key, special?

R Smith
In reply to this post by Yannick Duchêne


On 2016/01/31 3:54 PM, Yannick Duchêne wrote:

> Hi all,
>
> Another mystery to me. Given this test table:
>
>      CREATE TABLE t (a TEXT, b TEXT, c INTEGER, PRIMARY KEY (a, b, c))
>
> … this query:
>
>      SELECT Sum(c) FROM t GROUP BY a
>
> … executes faster than any of these two:
>
>      SELECT Sum(c) FROM t GROUP BY b
>      SELECT Sum(c) FROM t GROUP BY c
>
> … which executes in about the same time together, proportionally to the number of returned rows. With `GROUP BY a`, execution times seems to be about half than with the two formers. Adding or not adding a `WITHOUT ROWID` gives the same. I give the number of rows, to show if the first one is faster than the second one, that's not because it would returns less rows, on the opposite, it returns a bit more then with grouping by `b`:
>
>   * Grouping by `a` results into 1360 rows in about 40ms +/-3;
>   * Grouping by `b` results into 1170 rows in about 65ms +/-5;
>   * Grouping by `c` results into 3154 rows in about 90ms +/-4.
>
> If the primary key declaration is removed, timing when grouping by `b` or `c` does not change, while timing when grouping by `a` become the same as with the two formers.
>
> I feel to witness this with both SQlite3 CLI and SQLiteBrowser (a detail I must mention after another thread).
>
> Is there any thing special with the first column of a composite primary key? From an implementation point of view, this may makes sense, but I still prefer to ask.

Yes, there is something special about it - but not what you think perhaps.

First understand what an Index is and how it works.

Imagine you are asked to find all the people whose surnames are
"Duchêne" from the telephone directory. You would be able to do this
quite fast, because the phonebook is indexed first by Surname, then
name, then address. Perhaps a Telephone directory schema might look like
this:
   CREATE TABLE PhoneBook (Surname TEXT, Name TEXT, Address TEXT,
PhoneNo TEXT, PRIMARY KEY (Surname, Name, Address) );

Your query might look like this:
   SELECT * FROM PhoneBook WHERE Surname='Duchêne';

Imagine now that you are asked to find all people named "Yannick" in the
phone directory, like so:
   SELECT * FROM PhoneBook WHERE Name='Yannick';

Immediately that would go very slow because you have to look at each
Surname and see if there are any Yannicks in there, and the same problem
arise if you are asked to look for a specific address.

You will have a bright idea right the first time you are asked to do
this - you will make a list of names in alphabetical order followed by
surnames and keep it separate, so if ever you are asked again to find
someone by name, you can reference this second list to quickly see the
name and surname, and perhaps use that info to find them in the
PhoneBook and get the rest of the info. This second list is what is
called an Index - but it is not the PRIMARY index.

If you wish for all those searched to go fast, you need 3 Indices, not
simply a 3-sectiion primary Index.

Perhaps this SCHEMA would better suit your needs:

   CREATE TABLE t (a TEXT, b TEXT, c INTEGER, PRIMARY KEY (a, b, c))
   CREATE INDEX t_1 ON t (b);
   CREATE INDEX t_1 ON t (c);

Be careful though, every next Index will require a full datalist plus some overhead worth of space, and will make INSERTs slower because it has to insert more times and re-organize the B-Tree of every index a bit.
Best is to decide which searches you will do, make all Indices you think will be needed, then try the queries (using explain query plan), see which Indices are used and that the speed is good, then remove those who are not used.

HTH,
Ryan

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Is the first column of a composite primary key, special?

Yannick Duchêne
On Sun, 31 Jan 2016 16:14:45 +0200
R Smith <[hidden email]> wrote:

I'm replying to Igor too with this message, as both of you had a similar answer.

>
> First understand what an Index is and how it works.
>
> Imagine you are asked to find all the people whose surnames are
> "Duchêne" from the telephone directory. You would be able to do this
> quite fast, because the phonebook is indexed first by Surname, then
> name, then address. Perhaps a Telephone directory schema might look like
> this:
>    CREATE TABLE PhoneBook (Surname TEXT, Name TEXT, Address TEXT,
> PhoneNo TEXT, PRIMARY KEY (Surname, Name, Address) );
>
> Your query might look like this:
>    SELECT * FROM PhoneBook WHERE Surname='Duchêne';
>
> Imagine now that you are asked to find all people named "Yannick" in the
> phone directory, like so:
>    SELECT * FROM PhoneBook WHERE Name='Yannick';
>
> Immediately that would go very slow because you have to look at each
> Surname and see if there are any Yannicks in there, and the same problem
> arise if you are asked to look for a specific address.


That makes sense, I agree, and I had this in mind for a short time, until it was shadowed by a bad bet I made for three reasons. If it's worth the words, let me tell:

I saw a page (can't retrieve the URL) suggesting to order table columns by names. It was strange to me, as I had the idea of a hierarchical access for tables access. But I though “there must be a good reason for them to say this”. Then in an SQLite page [1], there was a suggestion to avoid index containing the same data as a wider index. So after these two things, I tried to imagine ways of setting up an index so that this makes sense: I though a multi‑column key could be accessed by any column, using fragments whose content are ordered.

Precisely with the case of your example, I though the "name" column would be partitioned into individually sorted parts. While it was also contradicted by the fact adding a index on a single column of a multi‑column primary key, could help grouping (although later again, there was another surprise contradicting this too).

[1]: https://www.sqlite.org/queryplanner.html
Which says
> Note that Idx3 contains all the same information as the original Idx1. And so if we have Idx3, we do not really need Idx1 any more.
While reading it again, I overlooked what was next:
> your database schema should never contain two indices where one index is a prefix of the other.
My bad.


> You will have a bright idea right the first time you are asked to do
> this - you will make a list of names in alphabetical order followed by
> surnames and keep it separate, so if ever you are asked again to find
> someone by name, you can reference this second list to quickly see the
> name and surname, and perhaps use that info to find them in the
> PhoneBook and get the rest of the info. This second list is what is
> called an Index - but it is not the PRIMARY index.
>
> If you wish for all those searched to go fast, you need 3 Indices, not
> simply a 3-sectiion primary Index.
>
> Perhaps this SCHEMA would better suit your needs:
>
>    CREATE TABLE t (a TEXT, b TEXT, c INTEGER, PRIMARY KEY (a, b, c))
>    CREATE INDEX t_1 ON t (b);
>    CREATE INDEX t_1 ON t (c);
>
> Be careful though, every next Index will require a full datalist plus some overhead worth of space, and will make INSERTs slower because it has to insert more times and re-organize the B-Tree of every index a bit.

I guessed, and that's why I have no swapped index (although the a,b is a foreign key to another table with only a,b, which helps for some queries, but there is no table for b,a) and prefer an attempt to tweak the query on the table as‑is (for now, as I will have a third refactoring).

> Best is to decide which searches you will do, make all Indices you think will be needed, then try the queries (using explain query plan), see which Indices are used and that the speed is good, then remove those who are not used.

That's what I did, and it shown me on a query with a single `GROUP BY`, an index on "b" helped and was indeed used. Then later, it shown a variant I had the idea to test, with two nested `GROUP BY`, is running faster (an efficiency not far from that of the same query on the first column) while not using this column index at all (which I finally removed). That's how I ended with these tests and the question.



May be that's the opportunity for another question I have: given a foreign key (a,b) where "a" and "b" are more than a few bytes (not small) or are of variable size (still hopefully limited), are the values for "a" and "b" duplicated or do the foreign key creates a kind of references? (may be with an hash or even a short ID for the bigger value) If it's duplicated, then I will use integer keys instead. A bit long ago, I questioned the habit of mechanically using integers PK (and also FK), feeling using the literal values is more readable and simplifies queries' text. If my assumption is wrong (i.e. this is not by reference and there are copies every where), then I will have views for readable consultation and will bother less about more verbose queries.


I enjoyed your answer to all three of you (will look at `ANALYSE` in the documentation, suggested by Dominique).

@Igor: I won't forget to have a look at the two references you gave.


--
Yannick Duchêne
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Is the first column of a composite primary key, special?

Keith Medcalf
In reply to this post by Yannick Duchêne

On Sunday, 31 January, 2016 06:54

>
> Another mystery to me. Given this test table:
>
>     CREATE TABLE t (a TEXT, b TEXT, c INTEGER, PRIMARY KEY (a, b, c))
>
> … this query:
>
>     SELECT Sum(c) FROM t GROUP BY a
>
> … executes faster than any of these two:
>
>     SELECT Sum(c) FROM t GROUP BY b
>     SELECT Sum(c) FROM t GROUP BY c
>

Your results are expected.  There is a wonderful invention in SQL called EXPLAIN which is designed to EXPLAIN things to you.  It is very simple to use, one just prepends the phrase "EXPLAIN QUERY PLAN" to the thing you want explained to you.  If you wish to know *how* the solution is implemented, then EXPLAIN by itself will give the VDBE code that needs to be executed to implement your query.

In the present example,

sqlite> explain query plan SELECT Sum(c) FROM t GROUP BY a;
0|0|0|SCAN TABLE t USING COVERING INDEX sqlite_autoindex_t_1

sqlite> explain query plan SELECT Sum(c) FROM t GROUP BY b;
0|0|0|SCAN TABLE t
0|0|0|USE TEMP B-TREE FOR GROUP BY

sqlite> explain query plan SELECT Sum(c) FROM t GROUP BY c;
0|0|0|SCAN TABLE t
0|0|0|USE TEMP B-TREE FOR GROUP BY

So you see, the first query is able to use an index (your primary key) because the data is already sorted in the order required.
Your other two queries do more work because the data is un-ordered and must be sorted in order to compute the results requested.

Or, the first does ONE operation, and the others do TWO operations.  You should therefore expect (in a rudimentary fashion) the second two queries each to take longer than the first.  (Not necessarily true since a large multistep plan can produce results much faster than a brute force method in many cases, provided that the necessary indexes are present).

> … which executes in about the same time together, proportionally to the
> number of returned rows. With `GROUP BY a`, execution times seems to be
> about half than with the two formers. Adding or not adding a `WITHOUT
> ROWID` gives the same. I give the number of rows, to show if the first one
> is faster than the second one, that's not because it would returns less
> rows, on the opposite, it returns a bit more then with grouping by `b`:
>
>  * Grouping by `a` results into 1360 rows in about 40ms +/-3;
>  * Grouping by `b` results into 1170 rows in about 65ms +/-5;
>  * Grouping by `c` results into 3154 rows in about 90ms +/-4.
>
> If the primary key declaration is removed, timing when grouping by `b` or
> `c` does not change, while timing when grouping by `a` become the same as
> with the two formers.
>
> I feel to witness this with both SQlite3 CLI and SQLiteBrowser (a detail I
> must mention after another thread).
>
> Is there any thing special with the first column of a composite primary
> key? From an implementation point of view, this may makes sense, but I
> still prefer to ask.

No, there is nothing "special" about the columns in an index.  An index which sorts by (a,b,c) is still sorted by (a,b,c), (a,b) and (a).
When you ask for a result which requires the data to be sorted by a, the index can be used to retrieve the data in the correct order.
When you ask for a result which requires the data to be sorted by b or c, the index is useless in retrieving the data in the correct order.

If you asked for WHERE a=5 GROUP BY b, the index would be useful because it would return the data in the correct order.  If you asked for WHERE a=5 GROUP BY c, then the index would be useful for finding all the candidate rows (where a=5) but would not be helpful in the GROUP BY c (unless the distribution of b were small so that a skip-scan was reasonable).

> I first noticed this another way. This test was just to check on a simpler
> case.
>
> Initially, I indirectly noticed this with something similar to this:
>
>     SELECT b, Sum(c) AS c
>       FROM
>        (SELECT b, Sum(c) AS c
>         FROM t
>         GROUP BY a, b)
>       GROUP BY b -- 60 ms on average
>
> … being faster than this second simpler alternative, something I notice
> with the test table too, just that the difference is less:
>
>     SELECT b, Sum(c) AS c
>       FROM t
>       GROUP BY b -- 65 to 70 ms on average
>
> … although the first one seems to run more operations, and it's still the
> same if I add an index on `b` for the second alternative and thus it does
> not use a temporary B‑tree for grouping.
>
>
> I also noticed some other cases where queries executes faster on the first
> column of a composite key (with or without indexes), but I won't expose
> all cases, as I'm already too lengthy.

A composite key (as in a declared primary key or unique constaint) is an index.

> Enough testing for now, I will resume the investigations on this
> unexpected results, later.




_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Is the first column of a composite primary key, special?

Simon Slavin-3
In reply to this post by Yannick Duchêne

On 31 Jan 2016, at 4:39pm, Yannick Duchêne <[hidden email]> wrote:

> I saw a page (can't retrieve the URL) suggesting to order table columns by names. It was strange to me, as I had the idea of a hierarchical access for tables access. But I though “there must be a good reason for them to say this”. Then in an SQLite page [1], there was a suggestion to avoid index containing the same data as a wider index. So after these two things, I tried to imagine ways of setting up an index so that this makes sense: I though a multi‑column key could be accessed by any column, using fragments whose content are ordered.
>
> Precisely with the case of your example, I though the "name" column would be partitioned into individually sorted parts. While it was also contradicted by the fact adding a index on a single column of a multi‑column primary key, could help grouping (although later again, there was another surprise contradicting this too).

Ignore all the above.  There are rare situations where they're useful but the situation you're in is helped far more by using the phonebook analogy earlier posters used than by trying to use the above.

Think about pure SQL, and about making one ideal index for each SELECT command, and you'll get very good results.  Work out what you want the SELECT (or UPDATE, etc.) to do.  Then work out the command.  Then work out the index which would be ideal to help the command do its thing.  That's the best way to get fast SQL.

Optimizing for SQLite peculiarities (some of which no longer apply because inner workings of SQLite have changed since the article was written) is useful only very rarely.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Is the first column of a composite primary key, special?

Keith Medcalf
In reply to this post by Yannick Duchêne

> May be that's the opportunity for another question I have: given a foreign
> key (a,b) where "a" and "b" are more than a few bytes (not small) or are
> of variable size (still hopefully limited), are the values for "a" and "b"
> duplicated or do the foreign key creates a kind of references? (may be
> with an hash or even a short ID for the bigger value) If it's duplicated,
> then I will use integer keys instead. A bit long ago, I questioned the
> habit of mechanically using integers PK (and also FK), feeling using the
> literal values is more readable and simplifies queries' text. If my
> assumption is wrong (i.e. this is not by reference and there are copies
> every where), then I will have views for readable consultation and will
> bother less about more verbose queries.

An index will contain the actual data from the columns being indexed, plus the row number of the table to which the entry corresponds (or, for without rowid tables, the ENTIRE PRIMARY KEY column data to use to locate the underlying record).  This is why it is called a RELATIONAL DATABASE.  The entire thing operates only on the DATA and nothing but the data and does not contain "pointers".

Hierarchical, Network, and Network Extended database models use pointers in sets rather than duplicating the data.  This makes them orders of magnitude faster (when properly designed) than a Relational Model database, but means that there is no recovery possible where a pointer-chain becomes corrupted -- with the relational model everything has a copy (multiple duplicates) of the data so you just drop the corrupted thing and re-create it.

If a column contains a bunch of repeated data (say a phone book).  You can create the structures the following ways:

create table PhoneBookEntry
(
  surname text collate nocase not null,
  firstname text collate nocase not null,
  ...
  primary key (surname, firstname)
);

This will duplicate all the surnames and firstnames in the index.  It also means that you cannot have two people with the same name.  So maybe you add something else to the key, like the address.  This is then duplicated as well in the index and now your constraint is that you cannot have two people with the same names at the same address.

If you analyzed the problem, you would note that the Phone Number is the actual primary key, and the other data are merely attributes of the phone number.  For most efficient operations you would probably end up with something like this:

create table Surnames
(
 surname_id integer primary key,
 surname text not null collate nocase unique
);

create table GivenNames
(
 given_id integer primary key,
 givenname text not null collate nocase unique
);

create table Streets
(
  street_id integer primary key,
  streetname text not null collate nocase unique
);

create table Addresses
(
 address_id integer primary key,
 street_no integer,
 suffix text collate nocase,
 street_id integer not null references Streets
);
create unique index Address_Streetid on Addresses (street_id);
create index Address_Streetno on Addresses (Street_no, Suffix);

create table PhoneDirectory
(
 surname_id integer not null references Surnames,
 given_id integer not null references GivenNames,
 address_id integer not null references Addresses,
 PhoneNumber text collate nocase primary key
);
create index PD_Surnameid on PhoneDirectory (surname_id);
create index PD_Givenid on PhoneDirectory (given_id);
create index PD_addressid on PhoneDirectory (address_id);

create view v_PhoneDirectory
as
      select Surname, Givenname, Street_No, Suffix, Streetname, PhoneNumber
        from PhoneDirectory
natural join Addresses
natural join Streets
natural join GivenNames
natural join Surnames;

Then you use the view for all your queries of the PhoneDirectory.  This works very well if the optimizer properly prunes out any unnecessary joins caused by unreferenced columns in the view.  Even if not, all queries will run very quickly even with billions of records in the PhoneDirectory.  (NATURAL JOIN is syntactic sugar creating a WHERE clause that joins on like named columns).

This database would be in BCNF normal form.  (Although the streetno and suffix ought to be moved out to a separate table(s) if you need 4th or 5th normal).  The model gets very much more complicated if you also have to handle municipal names, city names, etc.

Note that with this structure you can find all the "John" living on "Queen Street" in just nanoseconds.  Using it to produce a printed phone book, however, is not very efficient.  The structure is designed to optimize querying and eliminate update anomolies.  Since printed phone books are only produced once a year, the fact that this operation is not efficient is immaterial.




_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Is the first column of a composite primary key, special?

Yannick Duchêne
On Sun, 31 Jan 2016 10:45:59 -0700
"Keith Medcalf" <[hidden email]> wrote:

> create table PhoneDirectory
> (
>  surname_id integer not null references Surnames,
>  given_id integer not null references GivenNames,
>  address_id integer not null references Addresses,
>  PhoneNumber text collate nocase primary key
> );
> create index PD_Surnameid on PhoneDirectory (surname_id);
> create index PD_Givenid on PhoneDirectory (given_id);
> create index PD_addressid on PhoneDirectory (address_id);
>
> create view v_PhoneDirectory
> as
>       select Surname, Givenname, Street_No, Suffix, Streetname, PhoneNumber
>         from PhoneDirectory
> natural join Addresses
> natural join Streets
> natural join GivenNames
> natural join Surnames;

Now I understand why `NATURAL JOIN` is qualified so, and why it has an implicit `USING` clause. That's to help with this.

> This database would be in BCNF normal form.  (Although the streetno and suffix ought to be moved out to a separate table(s) if you need 4th or 5th normal).  The model gets very much more complicated if you also have to handle municipal names, city names, etc.

The normalized forms highest of level, are to be weighted, anyway.


--
Yannick Duchêne
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Is the first column of a composite primary key, special?

Yannick Duchêne
In reply to this post by Simon Slavin-3
On Sun, 31 Jan 2016 17:22:37 +0000
Simon Slavin <[hidden email]> wrote:

>
> Ignore all the above.  There are rare situations where they're useful but the situation you're in is helped far more by using the phonebook analogy earlier posters used than by trying to use the above.
>
> Think about pure SQL, and about making one ideal index for each SELECT command, and you'll get very good results.  Work out what you want the SELECT (or UPDATE, etc.) to do.  Then work out the command.  Then work out the index which would be ideal to help the command do its thing.  That's the best way to get fast SQL.
>
> Optimizing for SQLite peculiarities (some of which no longer apply because inner workings of SQLite have changed since the article was written) is useful only very rarely.
>

I agree. I'm, drifting to much far from one of my concern to stick to just SQL (and standard SQL).

In the meantime, I was looking at what SQLite do with the queries, to see if it's intuitive, if it matches enough what one would devise without a DB engine. I mean I sometime think about SQL as a data structure modelling language, and I may try to re‑implement in a classic procedural language, for an experiment. I'm also aware this point of view (SQL as a data structure modelling language) is less meaningful with complex queries, as with these, a DB engine adds values outweighing the hopes one may get from a procedural implementation, … the same if the set of queries is not fixed.

That said, thanks for the recall.

--
Yannick Duchêne
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Is the first column of a composite primary key, special?

Simon Slavin-3

On 31 Jan 2016, at 6:51pm, Yannick Duchêne <[hidden email]> wrote:
>
> In the meantime, I was looking at what SQLite do with the queries, to see if it's intuitive, if it matches enough what one would devise without a DB engine. I mean I sometime think about SQL as a data structure modelling language, and I may try to re‑implement in a classic procedural language, for an experiment. I'm also aware this point of view (SQL as a data structure modelling language) is less meaningful with complex queries, as with these, a DB engine adds values outweighing the hopes one may get from a procedural implementation, … the same if the set of queries is not fixed.

It's all good.  SQLite provides two wonderful tools to help you figure out what choices it's making and why: EXPLAIN and EXPLAIN QUERY PLAN.  You can also experiment with ANALYZE which does a job which many people have wasted hours trying to do.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Is the first column of a composite primary key, special?

James K. Lowden
In reply to this post by Keith Medcalf
On Sun, 31 Jan 2016 10:45:59 -0700
"Keith Medcalf" <[hidden email]> wrote:

> Hierarchical, Network, and Network Extended database models use
> pointers in sets rather than duplicating the data.  This makes them
> orders of magnitude faster (when properly designed) than a Relational
> Model database,

I was cheering you on ...


> but means that there is no recovery possible where a
> pointer-chain becomes corrupted -- with the relational model
> everything has a copy (multiple duplicates) of the data so you just
> drop the corrupted thing and re-create it.

... but I have to take exception to your characterization of
the theory.  :-(  

The relational model, as you well know, doesn't describe
implementation.  It's math.  It says what relations are, and how
they're manipulated.  One data type, and an algebra closed over that
domain.  Math doesn't have pointers or duplication or corruption; those
are computer concepts, and as such are entirely outside the model.  

For example, nothing in the model prohibits or describes using
compression to minimize I/O, or a hash of interned strings.  It's up to
the implementation to find the best way to support relational
operations, and to define "best".  SQLite, after all, supports in-memory
databases, about as fragile and unrecoverable a thing as imaginable!

In explaining the relational model, it's true Codd does mention
pointers by way of contrasting the relational model with the others you
mentioned.  Very much intentionally, the relational model consists only
of values: every join is "value based", meaning the join of A to B is
expressed strictly in terms of the values in each one.   It does not
matter if A is the "parent" and B is the "child"; the syntax is the
same either way.  

We might say Codd's hand was forced, in that *math* is value-based.  But
he emphasized it as a feature that should be manifested in the query
language, and SQL consequently was (and for the most part, still is)
value-based.  That choice is for the user's sake, because it's easier
for humans to reason about values than about pointers.  

In those old pre-relational systems -- names for which Codd had to come
up with, by the way, because they had no "model" per se, no math --
relationships between "tables" (to use a modern term) were expressed by
pointers of some kind.  The connection was manifested in the database.
If you followed it in your application, you got DBMS support, and it
was simple(ish) and fast. If you wanted an unsupported connection --
count of orders by widget, say -- you were forced to write loops, and
well advised to get coffee while the query ran.  

When we say the relational model "has no pointers", we're referring to
the user's interaction with the data.  All tables are created equal,
you might say, and all joins are the same.  That's the simplification
that permits the advice you so often give: to express the problem
logically.  Pre-relational systems offered no such option.  

We now return you to your regularly scheduled programming.  

--jkl
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Is the first column of a composite primary key, special?

James K. Lowden
In reply to this post by Yannick Duchêne
On Sun, 31 Jan 2016 17:39:28 +0100
Yannick Duchêne <[hidden email]> wrote:

> I saw a page (can't retrieve the URL) suggesting to order table
> columns by names. It was strange to me, as I had the idea of a
> hierarchical access for tables access. But I though ?there must be a
> good reason for them to say this?.

On average, the quality of advice on the Internet is average, for SQL
doubly so.  Some of it is truely terrible.  

Because the advice on this list is peer-reviewed (in the sense that
people who wrote the code participate), it tends to be very good.  

My advice, ahem, is to choose chemistry over alchemy.  If you don't
understand why the advice is well founded, keep checking until you do.
If sometimes good foundations are a little mysterious, the contrary is
not true: unfounded assumptions flush out early.  

I want to answer your question a little bit abstractly, and then circle
back to SQLite.  

We know how the table is physically ordered.  But there's no WHERE
clause; the whole table will be scanned.  Building an output table for
the aggregates will be required regardless.  The only difference would
be if the cardinality of a, b, and c were differerent.  That is, if
GROUP BY A produces many fewer rows than GROUP BY B, we would expect it
to run faster. Otherwise it's an artifact of the implementation, not
something inherent in the problem.  

Yet your counts indicate the opposite: GROUP BY B is the smallest
output.  How to explain?  Or, um, EXPLAIN?  

Consider that GROUP BY A will cause each A output row to be summed
using the input (from t) sequentially.  As the input is consumed, we
move to the next output row whenever A changes.  There's no seeking
from one output row to another.  

For GROUP BY B, each input row, read sequentially, means a seek
(logically speaking) to the appropriate output row.  If the cost of
that seek is not O(1), it will add to the time used to create the
output.  Because SQLite tables are based on B+ trees, that seek cost
is O(log2 n).  

I'd say that's the source of the difference you're seeing. EXPLAIN
shows there's an optimization for GROUP BY A, probably because the
output can be constructed sequentially.  

And that's a defensible cboice, because aggregation-seeking isn't
usually a high cost (as in fact it's not in your case, either).

Aggregations by definition are reductions. Outputs are usually small,
because otherwise the GROUP BY produces incomprehensible results.
10,000 rows of aggregated output isn't unheard of, but it's unusually
many, and to find one row in 10,000 in a binary tree requires at most 15
comparisons.  More commonly outputs are in the hundreds of rows, and
need half that many.  It could be faster, but only at the expense of
code complexity and memory footprint.  

HTH, to see the lite.  

--jkl
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Is the first column of a composite primary key, special?

Yannick Duchêne
In reply to this post by James K. Lowden
On Sun, 31 Jan 2016 18:15:48 -0500
"James K. Lowden" <[hidden email]> wrote:

>
> The relational model, as you well know, doesn't describe
> implementation.  It's math.  It says what relations are, and how
> they're manipulated.  One data type, and an algebra closed over that
> domain.  Math doesn't have pointers or duplication or corruption; those
> are computer concepts, and as such are entirely outside the model.  

I feel better to read this, as I had a doubt on the assertion. There is even another way to comment, bringing pointer interpretation into the model: an identity, which is fine in a relation model, is a value of some type, to which equality applies; a pointer is an identity, although an identity is not always a pointer, as a pointer is additionally a kind of array index which implicitly implies a relation to one. So “pointing to” is a reasonable interpretation of some relations, when it reads from many to one (which may be for both ways, one way only, or not at all). An identity may still sometimes be interpreted as a pointer, with only the dereferencing method differing: either relation or array index.

That said, I tried to have some meditations on whether or not I should use ID in place of values or not, while testing it in practice.

Here is how I see it (to not say all) … identities and general values, are not the same things, as while same identity implies same value, same value does not always implies same identity. The latter, to be assumed, needs to be asserted. There are also representations. Sometimes there is not really a value, just an identity which is the only thing offering sense/meaning, and what may be erroneously seen as a value is rather a representation. For the latter case, this is better to deal with the identity rather than with the representation interpreted as a “value”, because the representation may vary, so it's cleaner to use the identity rather than the “value”, to make senses. To come back to values, when the only thing which locally matters with a value is equality, it is valid to locally assert the value is the identity; this may be handy with composite values (save complexity folding a complex value) or often repeated values (*may* save storage). But this comes with an issue, at least one SQL construct prevents from referring to the value corresponding to an identity (when not just equality is needed): within indexes. Ex. table T has column "a" (an ID) and "b" (an associated value), table U has a column "c", where "c" references column "a" from T, then there is no (or I know no) way to create an index on U(c) ordered by the associated value "b" from T.


That's abstract, but the practical result is there: the request grouping on what was a second column, is now *exactly* as fast as the request grouping on what was the first column (there is no more technical differences between the two queries), and the DB size is two third of its size before this refactoring. All of this, using ID in place of *some* values … however at the cost of more verbose queries (more verbose, but not untraceable, this is just tedious).


--
Yannick Duchêne
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Is the first column of a composite primary key, special?

James K. Lowden
On Tue, 2 Feb 2016 16:19:07 +0100
Yannick Duchêne <[hidden email]> wrote:

> There are also representations. Sometimes there is not really a
> value, just an identity which is the only thing offering
> sense/meaning, and what may be erroneously seen as a value is rather
> a representation.

Representation is all.  

The database -- the whole thing, front to back -- is just a model, an
approximation of the real world.  As designer, you decide how to model
that world, the "enterprise of interest".  You decide what "entities"
there are, what properties they have, how they are known (identified).

The DBMS knows nothing of what is being modelled.  It's "just" a logic
system. That logic operates on values. It doesn't distinguish between
kinds of values, between natural and surrogate keys.  That sort of thing
contributes to the understandability (and hence utility) of the model,
but they're extralogical: outside the theory.  

> All of this, using ID in place of *some* values?? however at the cost
> of more verbose queries

Back in the real world -- the computing world, where there's an actual
system implementing the math -- yes, some things are faster than
others, and some allowances have to be made for the limitations of the
implementation and the computer.  :-)  

--jkl
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users