Reducing index size

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Reducing index size

Eric Grange-3
Hi,

Is there a way to reduce the size of an index on strings/blobs ?

I have tables which are a key + value, the key is an "integer primary key
autoincrement", which is used for references in all other tables of the
schema.
The values are cryptographic GUIDs (so 256 to 512 bits in size) with a
"compact" encoding (either base64 or blob rather than hexadecimal strings),
but they still represent gigabytes of data.

Those tables have an index on the value, and my problem is that the size of
this index (as reported by dbstat or sql3_analyzer) is about the same
as the table.

As these are cryptographic GUIDs, the first few bytes of a values are in
practice unique, so in theory I can index just the first few bytes (using
substr()),
this indeed reduces in a much smaller index, but this also requires
adapting all queries that search by value.

Before starting down that route, is there another way?

My searches on those indexes are by either exact value or by value start
(human search & auto-completion)

Eric
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Reducing index size

Simon Slavin-3
On 30 Jul 2018, at 9:32am, Eric Grange <[hidden email]> wrote:

> As these are cryptographic GUIDs, the first few bytes of a values are in
> practice unique, so in theory I can index just the first few bytes (using
> substr()),
> this indeed reduces in a much smaller index, but this also requires
> adapting all queries that search by value.

Don;t index using substr().  That would be slow because it has to keep working out substr().  Instead create another column in the table called "hash" which contains the first few bbytes, and index that column instead of the full-length one.  If you define it

    hash BLOB UNIQUE

then SQLite will make up and maintain its own index on the column, which means you don't have to.  And it will check for uniqueness in case your assumption is wrong.

How you set that new column's value ... it could be done by modifying the INSERT.  Or with a TRIGGER.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Reducing index size

Paul Sanderson
In reply to this post by Eric Grange-3
If I understand correctly then changing from a base64 index to a blob
containing the raw bytes would save 25%

Paul
www.sandersonforensics.com
SQLite Forensics Book <https://www.amazon.co.uk/dp/ASIN/1980293074>

On 30 July 2018 at 09:32, Eric Grange <[hidden email]> wrote:

> Hi,
>
> Is there a way to reduce the size of an index on strings/blobs ?
>
> I have tables which are a key + value, the key is an "integer primary key
> autoincrement", which is used for references in all other tables of the
> schema.
> The values are cryptographic GUIDs (so 256 to 512 bits in size) with a
> "compact" encoding (either base64 or blob rather than hexadecimal strings),
> but they still represent gigabytes of data.
>
> Those tables have an index on the value, and my problem is that the size of
> this index (as reported by dbstat or sql3_analyzer) is about the same
> as the table.
>
> As these are cryptographic GUIDs, the first few bytes of a values are in
> practice unique, so in theory I can index just the first few bytes (using
> substr()),
> this indeed reduces in a much smaller index, but this also requires
> adapting all queries that search by value.
>
> Before starting down that route, is there another way?
>
> My searches on those indexes are by either exact value or by value start
> (human search & auto-completion)
>
> Eric
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Reducing index size

Dominique Devienne
In reply to this post by Simon Slavin-3
On Mon, Jul 30, 2018 at 10:42 AM Simon Slavin <[hidden email]> wrote:

> On 30 Jul 2018, at 9:32am, Eric Grange <[hidden email]> wrote:
>
> > As these are cryptographic GUIDs, the first few bytes of a values are in
> > practice unique, so in theory I can index just the first few bytes (using
> > substr()),
> > this indeed reduces in a much smaller index, but this also requires
> > adapting all queries that search by value.
>
> Don;t index using substr().  That would be slow because it has to keep
> working out substr().  Instead create another column in the table called
> "hash" which contains the first few bbytes, and index that column instead
> of the full-length one.  If you define it
>
>     hash BLOB UNIQUE
>
> then SQLite will make up and maintain its own index on the column, which
> means you don't have to.  And it will check for uniqueness in case your
> assumption is wrong.
>
> How you set that new column's value ... it could be done by modifying the
> INSERT.  Or with a TRIGGER.


But that's the rub IMHO. You're still storing that substring info, twice,
once in the table, another in the index.
SQLite supports function-based indexes, but unfortunately if does not
support "function-based columns".
(alas called virtual columns in Oracle, or Computed columns in SQL server I
believe. Not (yet) in Postgres).

The former allows you to get what you want, but as you wrote, you must
rewrite your queries. The latter,
if supported, would allow to move the "function definition" to the column,
and index the vcolumn directly.
Of course, if you have a new column, you have to change your queries
anyway, but at least you'd store
the substring only once, in the index, and no need to manually insert that
substring (or via a trigger, which
are best avoided, especially like so to store denormal data).

Many argue views can do the same as virtual columns, but they are in fact
much more convenient IMHO,
and allow to have a single table instead of a table/view combo. FWIW. --DD
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Reducing index size

Eric Grange
In reply to this post by Paul Sanderson
@Simon Slavin
> Don't index using substr().  That would be slow because it has to keep
working out substr().

I gave it a try, but that grows the size of the tables, and would require a
full update of the key/value table,
so not something that can be deployed without a lot of I/O.

The substr() should not be a problem, I am ready to trade disk space for
cpu time, and I was planning
to use a check() constraint to ensure uniqueness without an explicit unique
index (I can probably live
with the odd duplicates in the substr value, if the space gained is worth
it)

However, the beef of the issue is that all queries that are doing a "value
= xxx" will have to be rewritten
to something like "(substr(value, 1, 8) = substr(xxx, 1, 8) and value =
xxx)" and then check in the query plan
that the substr index is actually used.

So, if there is a "better" way... :)

@Paul Sanderson
> If I understand correctly then changing from a base64 index to a blob containing
the raw bytes would save 25%

Yes it would, but the problem is for human searches / auto-completions,
some keys are displayed as base64,
so humans type in the base64 first characters.
it might be possible to do a partial base64 decoding, and then filter on
re-encoded base64, but that would be
quite complex for "just" 25% size gained :/

Indexing on the string start still allows the filtering to occur with
"substr(value, 1, 8) between x and y" f.i.

Eric


On Mon, Jul 30, 2018 at 11:16 AM, Paul Sanderson <
[hidden email]> wrote:

> If I understand correctly then changing from a base64 index to a blob
> containing the raw bytes would save 25%
>
> Paul
> www.sandersonforensics.com
> SQLite Forensics Book <https://www.amazon.co.uk/dp/ASIN/1980293074>
>
> On 30 July 2018 at 09:32, Eric Grange <[hidden email]> wrote:
>
> > Hi,
> >
> > Is there a way to reduce the size of an index on strings/blobs ?
> >
> > I have tables which are a key + value, the key is an "integer primary key
> > autoincrement", which is used for references in all other tables of the
> > schema.
> > The values are cryptographic GUIDs (so 256 to 512 bits in size) with a
> > "compact" encoding (either base64 or blob rather than hexadecimal
> strings),
> > but they still represent gigabytes of data.
> >
> > Those tables have an index on the value, and my problem is that the size
> of
> > this index (as reported by dbstat or sql3_analyzer) is about the same
> > as the table.
> >
> > As these are cryptographic GUIDs, the first few bytes of a values are in
> > practice unique, so in theory I can index just the first few bytes (using
> > substr()),
> > this indeed reduces in a much smaller index, but this also requires
> > adapting all queries that search by value.
> >
> > Before starting down that route, is there another way?
> >
> > My searches on those indexes are by either exact value or by value start
> > (human search & auto-completion)
> >
> > Eric
> > _______________________________________________
> > sqlite-users mailing list
> > [hidden email]
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Reducing index size

Rowan Worth-2
In reply to this post by Dominique Devienne
On 30 July 2018 at 17:25, Dominique Devienne <[hidden email]> wrote:

> On Mon, Jul 30, 2018 at 10:42 AM Simon Slavin <[hidden email]>
> wrote:
>
> > On 30 Jul 2018, at 9:32am, Eric Grange <[hidden email]> wrote:
> >
> > > As these are cryptographic GUIDs, the first few bytes of a values are
> in
> > > practice unique, so in theory I can index just the first few bytes
> (using
> > > substr()),
> > > this indeed reduces in a much smaller index, but this also requires
> > > adapting all queries that search by value.
> >
> > Don;t index using substr().  That would be slow because it has to keep
> > working out substr().  Instead create another column in the table called
> > "hash" which contains the first few bbytes, and index that column instead
> > of the full-length one.  If you define it
> >
> >     hash BLOB UNIQUE
> >
> > then SQLite will make up and maintain its own index on the column, which
> > means you don't have to.  And it will check for uniqueness in case your
> > assumption is wrong.
> >
> > How you set that new column's value ... it could be done by modifying the
> > INSERT.  Or with a TRIGGER.
>
>
> But that's the rub IMHO. You're still storing that substring info, twice,
> once in the table, another in the index.
> SQLite supports function-based indexes, but unfortunately if does not
> support "function-based columns".
> (alas called virtual columns in Oracle, or Computed columns in SQL server I
> believe. Not (yet) in Postgres).
>

What if you could create a "lite" index, which stores just the rowids in a
particular order and refers back to the table for the rest of the column
data? The ordering it provides would allows you to binary search as usual.
But the worst case might be too expensive (ie. if pages are thrown out of
cache before being re-used), especially as it would have to be paid every
time you lookup the index...

-Rowan
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Reducing index size

Simon Slavin-3
In reply to this post by Dominique Devienne
On 30 Jul 2018, at 10:25am, Dominique Devienne <[hidden email]> wrote:

> The former allows you to get what you want, but as you wrote, you must
> rewrite your queries. The latter,
> if supported, would allow to move the "function definition" to the column,
> and index the vcolumn directly.

It's the usual speed vs. filespace issue.  And as usual it comes down to

how speed-critical searches are
how often you do searches, and
relative frequency of making changes vs. searching

A later post from OP suggests that the critical point here really is filespace, not search time, so my previous recommendation is perhaps not appropriate for this situation.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Reducing index size

Eric Grange
@Dominique Devienne
> SQLite supports function-based indexes, but unfortunately if does not support
"function-based columns".

Far fetched maybe, but could a virtual table or table-valued functions be
used to provide that?

ie. use the virtual table to pass data directly to an index, and then
expose the data stored in the index as a column.
(I am not accomplished enough about sqlite indexes to know how far fetched
the above would be)

@Rowan Worth
> What if you could create a "lite" index, which stores just the rowids in
a particular order and
> refers back to the table for the rest of the column data?

As I have millions of rows, and data could get inserted anywhere in that
index (given the values are
essentially random), maintaining such an index would not be light work.

@Simon Slavin
> A later post from OP suggests that the critical point here really is
filespace, not search time,

Yes, this is because in the rest of the schema I am always using the key
(integer), so the value -> key
lookup only happens in API parameters, never in joins or other internal
processing.

The key -> value lookups need to be faster as they happen in SQL outputs,
but at worst each query
will be doing thousandths of those. So a little loss of performance there
could be acceptable there as well.

Eric


On Mon, Jul 30, 2018 at 11:40 AM, Simon Slavin <[hidden email]> wrote:

> On 30 Jul 2018, at 10:25am, Dominique Devienne <[hidden email]>
> wrote:
>
> > The former allows you to get what you want, but as you wrote, you must
> > rewrite your queries. The latter,
> > if supported, would allow to move the "function definition" to the
> column,
> > and index the vcolumn directly.
>
> It's the usual speed vs. filespace issue.  And as usual it comes down to
>
> how speed-critical searches are
> how often you do searches, and
> relative frequency of making changes vs. searching
>
> A later post from OP suggests that the critical point here really is
> filespace, not search time, so my previous recommendation is perhaps not
> appropriate for this situation.
>
> Simon.
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Reducing index size

Rowan Worth-2
On 30 July 2018 at 17:53, Eric Grange <[hidden email]> wrote:

> @Rowan Worth
> > What if you could create a "lite" index, which stores just the rowids in
> a particular order and
> > refers back to the table for the rest of the column data?
>
> As I have millions of rows, and data could get inserted anywhere in that
> index (given the values are
> essentially random), maintaining such an index would not be light work.
>

Doesn't that problem already exist with the current index? Except worse
because it's storing the cryptographic hash *and* the rowid.

I wasn't suggesting that you manage such an index yourself fwiw, but as a
potential future feature for sqlite - a simple mechanism to control space
used by an index.

-Rowan
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Reducing index size

Eric Grange
@Rowan Worth
> Doesn't that problem already exist with the current index? Except worse
> because it's storing the cryptographic hash *and* the rowid.

No, because SQLite is using a B-Tree (and with cryptographic hashes, it
should even take less effort to balance)



On Mon, Jul 30, 2018 at 12:05 PM, Rowan Worth <[hidden email]> wrote:

> On 30 July 2018 at 17:53, Eric Grange <[hidden email]> wrote:
>
> > @Rowan Worth
> > > What if you could create a "lite" index, which stores just the rowids
> in
> > a particular order and
> > > refers back to the table for the rest of the column data?
> >
> > As I have millions of rows, and data could get inserted anywhere in that
> > index (given the values are
> > essentially random), maintaining such an index would not be light work.
> >
>
> Doesn't that problem already exist with the current index? Except worse
> because it's storing the cryptographic hash *and* the rowid.
>
> I wasn't suggesting that you manage such an index yourself fwiw, but as a
> potential future feature for sqlite - a simple mechanism to control space
> used by an index.
>
> -Rowan
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Reducing index size

Rowan Worth-2
On 30 July 2018 at 18:10, Eric Grange <[hidden email]> wrote:

> @Rowan Worth
> > Doesn't that problem already exist with the current index? Except worse
> > because it's storing the cryptographic hash *and* the rowid.
>
> No, because SQLite is using a B-Tree (and with cryptographic hashes, it
> should even take less effort to balance)
>

Of course. My suggestion is to retain the same B-Tree structure, but only
store rowids in there. The B-Tree would still be ordered according to the
hash data - it would just be an indirect lookup to get at that data.

I'm not super familiar with B-Trees or eg. how many nodes need to be
touched when rebalancing, so it could well be a completely infeasible
approach performance-wise :) I feel like it just depends on whether you can
keep enough pages in cache to avoid constant I/O.

I wonder how FTS implements prefix search...
-Rowan
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Reducing index size

Donald Griggs
In reply to this post by Eric Grange-3
There's a good chance this comment won't be useful to you, Eric.
Nevertheless,

Any chance of relaxing your space requirement?   I.e., what bad things
happen if the space is not reduced?

Maybe you're writing for a fixed-space embedded device, which nonetheless
has space for the gigabytes required, and you can't expand without a chip
change for all your customers?    If not such a case, would adding, say,
$3-worth of additional disk space perhaps solve things?

Also -- If you implement your index on the first several bytes of your
cryptographic hash, then since you have millions of them per gigabyte,  am
I right that in addition to modifying your queries to extract the
substrings, you'd have the work of handling substring collisions?

One of the main reasons for using a database is to automate the creation
and use of efficient search algorithms with the tradeoff of more
expenditure on disk space.

I hope you find an acceptable solution,
   Donald





On Mon, Jul 30, 2018 at 4:32 AM Eric Grange <[hidden email]> wrote:

> Hi,
>
> Is there a way to reduce the size of an index on strings/blobs ?
>
> I have tables which are a key + value, the key is an "integer primary key
> autoincrement", which is used for references in all other tables of the
> schema.
> The values are cryptographic GUIDs (so 256 to 512 bits in size) with a
> "compact" encoding (either base64 or blob rather than hexadecimal strings),
> but they still represent gigabytes of data.
>
> Those tables have an index on the value, and my problem is that the size of
> this index (as reported by dbstat or sql3_analyzer) is about the same
> as the table.
>
> As these are cryptographic GUIDs, the first few bytes of a values are in
> practice unique, so in theory I can index just the first few bytes (using
> substr()),
> this indeed reduces in a much smaller index, but this also requires
> adapting all queries that search by value.
>
> Before starting down that route, is there another way?
>
> My searches on those indexes are by either exact value or by value start
> (human search & auto-completion)
>
> Eric
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Reducing index size

Eric Grange
 > Maybe you're writing for a fixed-space embedded device, which nonetheless
> has space for the gigabytes required

Actually I am at the other end of the spectrum and running out of SSD space
on the server-side,
with the higher SSD storage tiers being quite more expensive.
I am also leaving the realm of "baseline" bandwidth requirements for timely
backup & restore.

Cutting down on redundancy could give a couple years of breathing room,
hopefully enough
for terabyte SSDs and 10 Gbps network to become more widespread ;)

> I right that in addition to modifying your queries to extract the
> substrings, you'd have the work of handling substring collisions?

Yes, that means replacing a single "=" test in SQL with two (one for
filtering, one for ensuring the match).

Cutting down to 64 bits is enough in practice to make substring collisions
very unlikely, while dividing
the index size by 4 to 8.
I ran some tests and even 32 bits substrings have few enough collisions to
not be a problem performance-wise
(the lookup is just the first step of typically far more complex queries,
so even if it happens 10 time slower,
it does not matter much).


On Mon, Jul 30, 2018 at 7:23 PM, Donald Griggs <[hidden email]> wrote:

> There's a good chance this comment won't be useful to you, Eric.
> Nevertheless,
>
> Any chance of relaxing your space requirement?   I.e., what bad things
> happen if the space is not reduced?
>
> Maybe you're writing for a fixed-space embedded device, which nonetheless
> has space for the gigabytes required, and you can't expand without a chip
> change for all your customers?    If not such a case, would adding, say,
> $3-worth of additional disk space perhaps solve things?
>
> Also -- If you implement your index on the first several bytes of your
> cryptographic hash, then since you have millions of them per gigabyte,  am
> I right that in addition to modifying your queries to extract the
> substrings, you'd have the work of handling substring collisions?
>
> One of the main reasons for using a database is to automate the creation
> and use of efficient search algorithms with the tradeoff of more
> expenditure on disk space.
>
> I hope you find an acceptable solution,
>    Donald
>
>
>
>
>
> On Mon, Jul 30, 2018 at 4:32 AM Eric Grange <[hidden email]> wrote:
>
> > Hi,
> >
> > Is there a way to reduce the size of an index on strings/blobs ?
> >
> > I have tables which are a key + value, the key is an "integer primary key
> > autoincrement", which is used for references in all other tables of the
> > schema.
> > The values are cryptographic GUIDs (so 256 to 512 bits in size) with a
> > "compact" encoding (either base64 or blob rather than hexadecimal
> strings),
> > but they still represent gigabytes of data.
> >
> > Those tables have an index on the value, and my problem is that the size
> of
> > this index (as reported by dbstat or sql3_analyzer) is about the same
> > as the table.
> >
> > As these are cryptographic GUIDs, the first few bytes of a values are in
> > practice unique, so in theory I can index just the first few bytes (using
> > substr()),
> > this indeed reduces in a much smaller index, but this also requires
> > adapting all queries that search by value.
> >
> > Before starting down that route, is there another way?
> >
> > My searches on those indexes are by either exact value or by value start
> > (human search & auto-completion)
> >
> > Eric
> > _______________________________________________
> > sqlite-users mailing list
> > [hidden email]
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> >
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users