How to store 128 bit values

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

How to store 128 bit values

Steve Krulewitz
Hey all --

In the application I am working on (Songbird), we have a simple two
table schema representing the tracks in your music collection and the
properties on those tracks.  The keys used are all UUIDs (128 bit
number) which we are currently storing in hex string form in a text
column, so they literally appear in the database as
"ee89b823-e709-40c5-bed7-dcb0b2b791a8".  We do lots of lookups and
joins on these keys.

I was wondering if there is much to be gained by storing these 128 bit
values in binary rather than as strings.  We estimate roughly 20
properties per track, so in a moderate sized database you'd have 10k
rows in the tracks table and 200k rows in the properties table.  We
construct a lot of queries that involve joining the properties table
to the tracks table multiple times.

Since the integer type is limited to 64 bits, we would have to be
creative when storing 128 bit values.  Some possibilities:

- Use a pair of integer columns
- Continue to use a text column and shove binary data into it and hope
for the best
- Use a blob column, but I wonder how efficient indexes and joins are
on this column type?
- Create another table in the schema that maps integers to UUIDs, and
use these integers in the rest of the schema rather than the UUIDs.
This would add an additional join every time we want to do a query
based on UUID.

Is it worth the trouble to make a change here?

The full schema can be found here:
http://publicsvn.songbirdnest.com/browser/trunk/components/library/localdatabase/content/schema.sql

The tracks table is called "media_items" and the properties table is
called "resource_properties".

cheers,
-steve

-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: How to store 128 bit values

Andrew Finkenstadt
On 7/11/07, Steve Krulewitz <[hidden email]> wrote:

>
> I was wondering if there is much to be gained by storing these 128 bit
> values in binary rather than as strings.  We estimate roughly 20
> properties per track, so in a moderate sized database you'd have 10k
> rows in the tracks table and 200k rows in the properties table.  We
> construct a lot of queries that involve joining the properties table
> to the tracks table multiple times.
>
> Since the integer type is limited to 64 bits, we would have to be
> creative when storing 128 bit values.  Some possibilities:
>
> - Use a blob column, but I wonder how efficient indexes and joins are
> on this column type?



My experience says that blobs are compared with memcmp(), and that indexes
based on them appear to be efficient.  *MY* blobs that are treated as unique
keys are uniformly 20-byte structures, and are bound to parameters in my
code using sqlite::execution::bind_blob_reference, which avoids copying the
blobs.    (the routine is a shim for sqlite3_bind_blob(...,SQLITE_STATIC).
I also make sure that my blobs stay "in scope" for the duration of the
execution context.

andy
Reply | Threaded
Open this post in threaded view
|

Re: How to store 128 bit values

Scott Hess
In reply to this post by Steve Krulewitz
On 7/11/07, Steve Krulewitz <[hidden email]> wrote:

> In the application I am working on (Songbird), we have a simple two
> table schema representing the tracks in your music collection and the
> properties on those tracks.  The keys used are all UUIDs (128 bit
> number) which we are currently storing in hex string form in a text
> column, so they literally appear in the database as
> "ee89b823-e709-40c5-bed7-dcb0b2b791a8".  We do lots of lookups and
> joins on these keys.
>
> I was wondering if there is much to be gained by storing these 128 bit
> values in binary rather than as strings.

Probably, because in a database space is the same as speed in many
cases.  Smaller databases will generally be faster because you can
cover more items in the same number of disk reads.

> - Use a pair of integer columns

Probably a bad idea.  SQLite stores integers using a variable encoding
with 7 bits of data stored per byte.  That works well for data where
the high-order bits are mostly 0 (lots of small positive integers).
This data would be expected to have uniform distribution, so you'll
probably lose on both encoding/decoding overhead and space.

> - Continue to use a text column and shove binary data into it and hope
> for the best
> - Use a blob column, but I wonder how efficient indexes and joins are
> on this column type?

The blob column should be strictly more performant than the same-size
text column, because SQLite treats blob data as literal bytes, while
the text column may involve other interesting operations (though
generally doesn't).  Since in this case the blob column would be able
to store data more compactly, it is likely to win across the board.

> - Create another table in the schema that maps integers to UUIDs, and
> use these integers in the rest of the schema rather than the UUIDs.
> This would add an additional join every time we want to do a query
> based on UUID.

It really depends on how often this is happening.  SQLite tables are
organized by rowid.  An index on a table maps the indexed column to
rowid, so querying through an index involves traversing two
(generally) uncorrelated btrees.  If you have many tables with indices
on these UUID values, it might make sense to consolidate into a single
index and use rowids everywhere else.

You might even be able to get away without that index.  If you hash
the UUID and take the low-order 64 bits of the result, you could use
that as the rowid directly.  Even the low-order 32 bits might suffice.
 [Don't just take the low-order bits of the UUID directly, since you
have no idea if they were generated appropriately, you might run into
an unacceptable level of collisions.]

-scott

-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: How to store 128 bit values

D. Richard Hipp
In reply to this post by Steve Krulewitz
"Steve Krulewitz" <[hidden email]> wrote:

> Hey all --
>
> In the application I am working on (Songbird), we have a simple two
> table schema representing the tracks in your music collection and the
> properties on those tracks.  The keys used are all UUIDs (128 bit
> number) which we are currently storing in hex string form in a text
> column, so they literally appear in the database as
> "ee89b823-e709-40c5-bed7-dcb0b2b791a8".  We do lots of lookups and
> joins on these keys.
>
> I was wondering if there is much to be gained by storing these 128 bit
> values in binary rather than as strings.

Storing the UUIDs as BLOBs rather than as hex-encoded strings
will requires 21 bytes per key instead of 43.  So your indices
will have about twice their current fanout.  That means that
you can expect to roughly double your lookup performance on
a cold cache.  (If the database, or large parts of it, is
already in your OS cache, the difference will be much less
and will likely not be noticable.)

The downside of using BLOBs is that you cannot see them easily
during debugging when you do a "SELECT * FROM...".  You have
to use constructs like, "SELECT hex(uuid), ... FROM" which is
more typing.  You can fix that with a VIEW, I suppose.

The thing to consider is why you are using 128-bit UUIDs in
the first place?  Presumably you are doing this so that you
can sync and/or merge independently created databases without
having to worry about key collisions.  If you are not doing
syncs or merges or independently generated keys, then I can't
think of a good reason to use 128-bit UUIDs in the first
place.  Just use small integer autogenerated keys from SQLite.

Assuming you are doing syncing and merging, what I tend to
do in these kinds of situations is to create a mapping between
the universally unique ID (UUID) and a small integer ID that
is unique in the local database - call the latter the RID.
The RID is just an integer and can be your INTEGER PRIMARY KEY
for very fast lookups.  Looking up a record by INTEGER PRIMARY KEY
is always twice as fast as looking up the same record by any
other key, and because of high fanout can potentially be much
faster if you have a cold cache.  So I use the RID for joins
or other internal operations and only use the UUID for lookups
from external data.

For example, your schema might look something like this:

   CREATE TABLE album(
     aid INTEGER PRIMARY KEY,      -- Internally unique album ID
     uuid TEXT UNIQUE,             -- Universally unique album ID
     ...
   );
   CREATE TABLE track(
     tid INTEGER PRIMARY KEY,      -- Internally unique track ID
     uuid TEXT UNIQUE,             -- Universally unique track ID
     aid INTEGER REFERENCES album, -- Album containing this track
     ...
   );
   CREATE INDEX track_album ON track(aid);

A typical query might be to find all tracks of an album:

   SELECT * FROM track
   WHERE aid=(SELECT aid FROM album WHERE title=?)

And queries like this will run much faster using INTEGER
PRIMARY KEYS rather than UUIDs.

All that said, for even the largest music collection, your
database is unlikely to have more than a few thousand albums
and a few tens of thousands of tracks, and with such a small
database and running on a modern workstations, probably just
about anything you do is going to be fast enough.  The
optimizations described above are useful if you have tables
with millions of entries or if you are doing thousands
of queries per second or if you are running on some woefully
underpowered portable music player.  But for a personal
music library running on a workstation at human-interaction
speed, use whatever schema makes it easiest for you to type
in correct and working code.

--
D. Richard Hipp <[hidden email]>



-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

RE: How to store 128 bit values

RB Smissaert
> Looking up a record by INTEGER PRIMARY KEY is always twice as
> fast as looking up the same record by any other key

Didn't realize that, but I have a question in connection with this.
It seems if you do inserts on a table it is faster if you have no INTEGER
PRIMARY KEY on that table and then later create an integer key on that same
field. But that would mean that selects and joins may be slower.
Am I seeing this right or is there any way round this, so have fast inserts
and still have the INTEGER PRIMARY KEY?
Thanks for any advice.

RBS


-----Original Message-----
From: [hidden email] [mailto:[hidden email]]
Sent: 11 July 2007 19:33
To: [hidden email]
Subject: Re: [sqlite] How to store 128 bit values

"Steve Krulewitz" <[hidden email]> wrote:

> Hey all --
>
> In the application I am working on (Songbird), we have a simple two
> table schema representing the tracks in your music collection and the
> properties on those tracks.  The keys used are all UUIDs (128 bit
> number) which we are currently storing in hex string form in a text
> column, so they literally appear in the database as
> "ee89b823-e709-40c5-bed7-dcb0b2b791a8".  We do lots of lookups and
> joins on these keys.
>
> I was wondering if there is much to be gained by storing these 128 bit
> values in binary rather than as strings.

Storing the UUIDs as BLOBs rather than as hex-encoded strings
will requires 21 bytes per key instead of 43.  So your indices
will have about twice their current fanout.  That means that
you can expect to roughly double your lookup performance on
a cold cache.  (If the database, or large parts of it, is
already in your OS cache, the difference will be much less
and will likely not be noticable.)

The downside of using BLOBs is that you cannot see them easily
during debugging when you do a "SELECT * FROM...".  You have
to use constructs like, "SELECT hex(uuid), ... FROM" which is
more typing.  You can fix that with a VIEW, I suppose.

The thing to consider is why you are using 128-bit UUIDs in
the first place?  Presumably you are doing this so that you
can sync and/or merge independently created databases without
having to worry about key collisions.  If you are not doing
syncs or merges or independently generated keys, then I can't
think of a good reason to use 128-bit UUIDs in the first
place.  Just use small integer autogenerated keys from SQLite.

Assuming you are doing syncing and merging, what I tend to
do in these kinds of situations is to create a mapping between
the universally unique ID (UUID) and a small integer ID that
is unique in the local database - call the latter the RID.
The RID is just an integer and can be your INTEGER PRIMARY KEY
for very fast lookups.  Looking up a record by INTEGER PRIMARY KEY
is always twice as fast as looking up the same record by any
other key, and because of high fanout can potentially be much
faster if you have a cold cache.  So I use the RID for joins
or other internal operations and only use the UUID for lookups
from external data.

For example, your schema might look something like this:

   CREATE TABLE album(
     aid INTEGER PRIMARY KEY,      -- Internally unique album ID
     uuid TEXT UNIQUE,             -- Universally unique album ID
     ...
   );
   CREATE TABLE track(
     tid INTEGER PRIMARY KEY,      -- Internally unique track ID
     uuid TEXT UNIQUE,             -- Universally unique track ID
     aid INTEGER REFERENCES album, -- Album containing this track
     ...
   );
   CREATE INDEX track_album ON track(aid);

A typical query might be to find all tracks of an album:

   SELECT * FROM track
   WHERE aid=(SELECT aid FROM album WHERE title=?)

And queries like this will run much faster using INTEGER
PRIMARY KEYS rather than UUIDs.

All that said, for even the largest music collection, your
database is unlikely to have more than a few thousand albums
and a few tens of thousands of tracks, and with such a small
database and running on a modern workstations, probably just
about anything you do is going to be fast enough.  The
optimizations described above are useful if you have tables
with millions of entries or if you are doing thousands
of queries per second or if you are running on some woefully
underpowered portable music player.  But for a personal
music library running on a workstation at human-interaction
speed, use whatever schema makes it easiest for you to type
in correct and working code.

--
D. Richard Hipp <[hidden email]>



----------------------------------------------------------------------------
-
To unsubscribe, send email to [hidden email]
----------------------------------------------------------------------------
-




-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: How to store 128 bit values

D. Richard Hipp
"RB Smissaert" <[hidden email]> wrote:
> > Looking up a record by INTEGER PRIMARY KEY is always twice as
> > fast as looking up the same record by any other key
>
> Didn't realize that, but I have a question in connection with this.
> It seems if you do inserts on a table it is faster if you have no INTEGER
> PRIMARY KEY on that table and then later create an integer key on that same
> field. But that would mean that selects and joins may be slower.
> Am I seeing this right or is there any way round this, so have fast inserts
> and still have the INTEGER PRIMARY KEY?

The b-trees used for tables are optimized for inserting new
entries at the end of the table, because this is the common
case.  If you have an INTEGER PRIMARY KEY, then inserts will
therefore be fastest if you insert in primary key order.  If
you do not have an INTEGER PRIMARY KEY, then SQLite makes one
up for you automatically, and the made-up primary key (the
rowid) is almost always larger than all previous rowids so
the net effect is that you are always appending and always
going quickly.

I have not measured it, but I'm guessing that the speed
reduction caused by inserting INTEGER PRIMARY KEY out of
order is much less than the speed penalty of building a
separate index.

--
D. Richard Hipp <[hidden email]>


-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: How to store 128 bit values

Igor Tandetnik
In reply to this post by RB Smissaert
RB Smissaert <[hidden email]>
wrote:
> It seems if you do inserts on a table it is faster if you have no
> INTEGER
> PRIMARY KEY on that table

You _always_ have an INTEGER PRIMARY KEY on every table. It's part of
SQLite storage mechanism.

If you don't explicitly declare one, it's still there under the name
ROWID or OID (and a few other synonyms). By explicitly declaring it, you
simply give the same column yet another name.

So no, you won't gain anything by trying to avoid this column - it is
always there whether you declare it or not.

Igor Tandetnik


-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

RE: How to store 128 bit values

RB Smissaert
In reply to this post by D. Richard Hipp
> I have not measured it, but I'm guessing that the speed reduction
> caused by inserting INTEGER PRIMARY KEY out of order is much less
> than the speed penalty of building a separate index.

Thanks.
I did measure that and thought that the speed penalty of creating the new
index was less than the penalty of inserting with INTEGER PRIMARY KEY, but I
will have a look at this again. I suppose it depends a lot on the data.
Now of course I also will need to look at the speed penalty in selecting and
joining when there is no INTEGER PRIMARY KEY.

RBS


-----Original Message-----
From: [hidden email] [mailto:[hidden email]]
Sent: 11 July 2007 20:17
To: [hidden email]
Subject: Re: [sqlite] How to store 128 bit values

"RB Smissaert" <[hidden email]> wrote:
> > Looking up a record by INTEGER PRIMARY KEY is always twice as
> > fast as looking up the same record by any other key
>
> Didn't realize that, but I have a question in connection with this.
> It seems if you do inserts on a table it is faster if you have no INTEGER
> PRIMARY KEY on that table and then later create an integer key on that
same
> field. But that would mean that selects and joins may be slower.
> Am I seeing this right or is there any way round this, so have fast
inserts
> and still have the INTEGER PRIMARY KEY?

The b-trees used for tables are optimized for inserting new
entries at the end of the table, because this is the common
case.  If you have an INTEGER PRIMARY KEY, then inserts will
therefore be fastest if you insert in primary key order.  If
you do not have an INTEGER PRIMARY KEY, then SQLite makes one
up for you automatically, and the made-up primary key (the
rowid) is almost always larger than all previous rowids so
the net effect is that you are always appending and always
going quickly.

I have not measured it, but I'm guessing that the speed
reduction caused by inserting INTEGER PRIMARY KEY out of
order is much less than the speed penalty of building a
separate index.

--
D. Richard Hipp <[hidden email]>


----------------------------------------------------------------------------
-
To unsubscribe, send email to [hidden email]
----------------------------------------------------------------------------
-




-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

RE: Re: How to store 128 bit values

RB Smissaert
In reply to this post by Igor Tandetnik
> So no, you won't gain anything by trying to avoid this column -
> it is always there whether you declare it or not.

But I found that inserts were faster if I didn't create the table with
INTEGER PRIMARY KEY, so it looked I gained there, although I understand
I might lose out somewhere else.

RBS


-----Original Message-----
From: Igor Tandetnik [mailto:[hidden email]]
Sent: 11 July 2007 20:18
To: SQLite
Subject: [sqlite] Re: How to store 128 bit values

RB Smissaert <[hidden email]>
wrote:
> It seems if you do inserts on a table it is faster if you have no
> INTEGER
> PRIMARY KEY on that table

You _always_ have an INTEGER PRIMARY KEY on every table. It's part of
SQLite storage mechanism.

If you don't explicitly declare one, it's still there under the name
ROWID or OID (and a few other synonyms). By explicitly declaring it, you
simply give the same column yet another name.

So no, you won't gain anything by trying to avoid this column - it is
always there whether you declare it or not.

Igor Tandetnik


----------------------------------------------------------------------------
-
To unsubscribe, send email to [hidden email]
----------------------------------------------------------------------------
-




-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------

Reply | Threaded
Open this post in threaded view
|

Re: Re: How to store 128 bit values

Steve Krulewitz
Thanks for the advice all!

cheers,
-steve

-----------------------------------------------------------------------------
To unsubscribe, send email to [hidden email]
-----------------------------------------------------------------------------