The performance of indexed select

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

The performance of indexed select

Nick
I am trying to analysis the performance of indexed select.

CREATE TABLE t2(a INTEGER, b INTEGER, c TEXT);
CREATE INDEX t2c ON t2(c);

I think there may be much more leaf index b-tree pages whose header is
'0x0A' if the length of the content of index key 'c' is always 20-25 bytes,
as I notice the format of index inside sqlite consist of the index key and
rowid.

I can establish mapping relation between column 'c' and a new INTEGER column
'd'. Then I am wondering if it is reasonable to create new index t2(d) to
get a better performance, as sqlite stores INTEGER in a variable-length way
which means there will be less index pages.

So if it is correct that the performance of indexed select is up to the
number of index pages which is fetched in getPageNormal() within the select?
I think it has positive correlation but I do not know if it is the major
constraint.

And does sqlite have a profile tool to get call tree or execution time of
each functions? All I know is VDBE_PROFILE.

Thanks for any light you can shed.


I want to profile sqlite



--
Sent from: http://sqlite.1065341.n5.nabble.com/
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: The performance of indexed select

Simon Slavin-3
On 6 Jan 2018, at 3:32am, Nick <[hidden email]> wrote:

> I think there may be much more leaf index b-tree pages whose header is
> '0x0A' if the length of the content of index key 'c' is always 20-25 bytes,
> as I notice the format of index inside sqlite consist of the index key and
> rowid.

You’re overthinking it.  Establish your tables and indexes according to however the data works in your head.  Insert some plausible data into the tables. The more this data is like a fully-populated production database, the better.  Then run the SQL command "ANALYZE".

See if the results are fast enough for your intended purposes.  They probably will be.  Only if they’re not, start worrying about optimization.

Try to make SQL serve the way you want to organise your data, not the other way around.

> So if it is correct that the performance of indexed select is up to the
> number of index pages which is fetched in getPageNormal() within the select?
> I think it has positive correlation but I do not know if it is the major
> constraint.

More complicated than that.  For instance, once SQLite has found the right entry in the index it might need to look up that entry in the table to retrieve values which are not in the index.  So it might be better to make an index which contains all those values (called a "covering index").  But it might not, because that will make the index bigger, and will mean that your computer has to do a lot more disk access while searching the index.

> And does sqlite have a profile tool to get call tree or execution time of
> each functions? All I know is VDBE_PROFILE.

No.  But the shell tool has a timer which can closely time the execution of any command.  And that is a far more reliable way of knowing what will take longer or shorter in the real world than timing individual calls.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: The performance of indexed select

Clemens Ladisch
In reply to this post by Nick
Nick wrote:
>I am trying to analysis the performance of indexed select.
>
>CREATE TABLE t2(a INTEGER, b INTEGER, c TEXT);
>CREATE INDEX t2c ON t2(c);

Show the query that you are trying to analyze.


Regards,
Clemens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: The performance of indexed select

Nick
In reply to this post by Simon Slavin-3
Thank you Simon.

But I am still uncertain if it is a good way to replace column 'c'.

    CREATE TABLE t2(a INTEGER, b INTEGER, d INTEGER);
or:
    CREATE TABLE t2(a INTEGER, b INTEGER, c TEXT, d INTEGER);
and then
    CREATE INDEX t2d ON t2(d);
    SELECT count(*) FROM t2 WHERE d = xx;

I find it is indeed faster than t2(c).

Or in another word, if a TEXT column has similar meaning with an INTEGER
column in my applications,(such as use userID instead of userName, still the
way that the data works in my head:) ) is it recommended to use INTEGER one
in order to get a less index pages?  


One more small question:
> For instance, once SQLite has found the right entry in the index it might
> need to look up that entry in the table to retrieve values which are not
> in the index.

I understand the execution process you said. And in my opinion, sqlite
should fetch pages when looking up the entry both in the index and then in
the table. But I only found pages with '0x0A' and '0x02' when
getPageNormal() is called during the time running select SQL. Could you give
me any advises to find the code when sqlite fetching the '0x0D' pages?

Thanks.




--
Sent from: http://sqlite.1065341.n5.nabble.com/
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: The performance of indexed select

Nick
In reply to this post by Clemens Ladisch
Some simple SQLs:
    SELECT count(*) FROM t2 WHERE c = xx; (or d = xx)



--
Sent from: http://sqlite.1065341.n5.nabble.com/
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: The performance of indexed select

Simon Slavin-3
In reply to this post by Nick


On 6 Jan 2018, at 6:41am, Nick <[hidden email]> wrote:

> I find it is indeed faster than t2(c).

If you want to know which is the best index, create all the indexes you think might be good, run ANALYZE, then use

        EXPLAIN QUERY PLAN SELECT (rest of SELECT statement here)

and see which index SQLite chooses to use.  Then you can delete the index(es) it didn’t choose.

Note that this gives accurate results only if your tables have convincing data in.  It’s not pointless, but nowhere near as accurate, if you're running it on a test database with 10 entries.

> Or in another word, if a TEXT column has similar meaning with an INTEGER
> column in my applications,(such as use userID instead of userName, still the
> way that the data works in my head:) ) is it recommended to use INTEGER one
> in order to get a less index pages?  

The correct way to arrange your data is to have userID everywhere you don’t need to know the actual name.  UserID can always remain the same but people change their names.  You wouldn’t want to have to go update your invoice table every time someone got married, would you ?

When you need to print out your invoices showing the actual name, that’s what JOIN is for.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: The performance of indexed select

Clemens Ladisch
In reply to this post by Nick
Nick wrote:
> Or in another word, if a TEXT column has similar meaning with an INTEGER
> column in my applications,(such as use userID instead of userName, still the
> way that the data works in my head:) ) is it recommended to use INTEGER one
> in order to get a less index pages?

Yes; an index on a smaller column needs less space, so it's faster to load
from disk.

But you should always measure how much difference it actually makes.

>> For instance, once SQLite has found the right entry in the index it might
>> need to look up that entry in the table to retrieve values which are not
>> in the index.
>
> I understand the execution process you said. And in my opinion, sqlite
> should fetch pages when looking up the entry both in the index and then in
> the table. But I only found pages with '0x0A' and '0x02' when
> getPageNormal() is called during the time running select SQL. Could you give
> me any advises to find the code when sqlite fetching the '0x0D' pages?

For count(*), the database does not need the actual table rows.
You'd have to run a query that reads some other column:

  SELECT sum(a) FROM t2 WHERE d = xx;


Regards,
Clemens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: The performance of indexed select

Dinu
Clemens Ladisch wrote
> For count(*), the database does not need the actual table rows.

I think this is not true, he has a point here: SELECT COUNT(*) WHERE
<idx_key>=? needs to examine every index key prefix (excluding at least
ROWID) that matches. This may mean reading in the whole index.

I think b-trees can store the counts of descendant nodes for every node to
solve this issue in O(log n), but I don't see anything like it in the SQLite
format.




--
Sent from: http://sqlite.1065341.n5.nabble.com/
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: The performance of indexed select

Richard Hipp-3
On 1/6/18, Dinu <[hidden email]> wrote:
>
> I think b-trees can store the counts of descendant nodes for every node to
> solve this issue in O(log n), but I don't see anything like it in the SQLite
> format.

They can do that, but it also means that all the parent b-tree pages
must be updated whenever an entry is added or removed from a leaf
page, which increases the amount of I/O needed for operations like
INSERT, DELETE, or UPDATE.  It seemed to me that INSERT, DELETE, and
UPDATE are rather more common than SELECT count(*),  so I decided not
to keep a child count in the b-tree pages when I first designed the
current b-tree format for SQLite.....  back in 2003.
--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: The performance of indexed select

Dinu
Richard Hipp-3 wrote
> all the parent b-tree pages must be updated

Yup, no question about it, at best it could be an opt-in. But as it is a
design decision, I checked to make sure count() really is O(n) as Jonathan's
question implied.



--
Sent from: http://sqlite.1065341.n5.nabble.com/
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: The performance of indexed select

Simon Slavin-3
On 6 Jan 2018, at 8:42pm, Dinu <[hidden email]> wrote:

> Richard Hipp-3 wrote
>> all the parent b-tree pages must be updated
>
> Yup, no question about it, at best it could be an opt-in. But as it is a
> design decision, I checked to make sure count() really is O(n) as Jonathan's
> question implied.

It would be possible instead just to keep a count of the total number of rows in each table.   The count could be kept with the table header, or in a new column of the table structure as the current autoincrement counter

<https://sqlite.org/fileformat2.html#seqtab>

Just like the discussion up to this point, it would be very difficult to figure out, for the average user, whether doing this either way would make things significantly faster or slower.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: The performance of indexed select

Keith Medcalf
In reply to this post by Nick

If I understand your question correctly you have not normalized your data.  The whole point of a RELATIONAL DATABASE is that the relationships are based ON THE DATA and ONLY ON THE DATA.  If you have not normalized you data to at least BCNF you can expect terrible performance and all sorts of access and update anomalies.

https://en.wikipedia.org/wiki/Database_normalization

Your schema, assuming that c is unique text should be something like (add the COLLATE NOCASE if it is not case sensitive):

create table t3
(
  d integer primary key,
  c text [collate nocase] unique
);
create table t2
(
  a INTEGER,
  b INTEGER,
  d INTEGER REFERENCES t3
);
create index t2d on t2 (d);
+ whatever indexes you need on a and b ...

When you add a new c, you insert it into t3 and then find out what the last_rowid (d) was so you can insert a and b and d in t2.  Or if the c already exists, then you lookup d and use a, b, d to insert into t2.

populate it with data, and run ANALYZE;

select a,b,c from t3, t2
 where t2.d = t3.d
   and (your other conditions on a b and c)

Then let the query planner decide how to compute the answer ...

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.


>-----Original Message-----
>From: sqlite-users [mailto:sqlite-users-
>[hidden email]] On Behalf Of Nick
>Sent: Friday, 5 January, 2018 20:32
>To: [hidden email]
>Subject: [sqlite] The performance of indexed select
>
>I am trying to analysis the performance of indexed select.
>
>CREATE TABLE t2(a INTEGER, b INTEGER, c TEXT);
>CREATE INDEX t2c ON t2(c);
>
>I think there may be much more leaf index b-tree pages whose header
>is
>'0x0A' if the length of the content of index key 'c' is always 20-25
>bytes,
>as I notice the format of index inside sqlite consist of the index
>key and
>rowid.
>
>I can establish mapping relation between column 'c' and a new INTEGER
>column
>'d'. Then I am wondering if it is reasonable to create new index
>t2(d) to
>get a better performance, as sqlite stores INTEGER in a variable-
>length way
>which means there will be less index pages.
>
>So if it is correct that the performance of indexed select is up to
>the
>number of index pages which is fetched in getPageNormal() within the
>select?
>I think it has positive correlation but I do not know if it is the
>major
>constraint.
>
>And does sqlite have a profile tool to get call tree or execution
>time of
>each functions? All I know is VDBE_PROFILE.
>
>Thanks for any light you can shed.
>
>
>I want to profile sqlite
>
>
>
>--
>Sent from: http://sqlite.1065341.n5.nabble.com/
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: The performance of indexed select

Dinu
In reply to this post by Simon Slavin-3
I think storing index prefix counts would only make sense in a special kind
of 'statistical' index, where you would store count(x IS NOT NULL), sum(x),
sum(x^2) so that usual statistical functions can be computed optimally.

For a table count, I think it would make sense.



--
Sent from: http://sqlite.1065341.n5.nabble.com/
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: The performance of indexed select

Dinu
What I do notice reading https://www.sqlite.org/fileformat.html (if I get it
right) is that index lines are in the form (for an index on a,b,c ie):

<a1> <b1> <c1> <pk1>
<a1> <b1> <c1> <pk2>
<a1> <b1> <c1> <pk3>

Whereas they could be represented as:

<a1> <b1> <c1> [ <pk1>, <pk2>, <pk3> ](3)

whith [pk_list] being a subtree; reverse lookup from table record to index
record would be arguably as fast if not faster. So this would have the
advantage of more compact indexes (less data), having an index line count
(no prefix so there is always just 1 update involved), with the downside of
the complexity of an added level of indirection.



--
Sent from: http://sqlite.1065341.n5.nabble.com/
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: The performance of indexed select

Dinu
... and the downside that it's just linear overhead for i.e. an unique index,
it works best for indexes with low cardinality... win some, lose some :)



--
Sent from: http://sqlite.1065341.n5.nabble.com/
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: The performance of indexed select

Nick
In reply to this post by Keith Medcalf
Thank you Keith for your useful advice. I am considering to organize the
columns based on BCNF.

I guess that table t3 is needed to remove functional dependency, which means
I should use table t2 and t3 instead of one table t2 with 4 columns a-d. Is
that right?

I am not familiar with the concept BCNF, and I want to make sure that if it
is recommended to create my tables in the way you wrote.

Thanks



--
Sent from: http://sqlite.1065341.n5.nabble.com/
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users