Sqlite Sharding HOWTO

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Sqlite Sharding HOWTO

Gerlando Falauto
Hi,

I'm totally new to sqlite and I'd like to use it for some logging
application on an embedded
linux-based device.  Data comes from multiple (~10), similar sources at a
steady rate.
The rolling data set would be in the size of 20 GB. Such an amount of
storage would suffice to retain data from the previous week or so.

Reading the documentation https://www.sqlite.org/whentouse.html somehow
suggests the usage of sharding:

>Concurrency is also improved by "database sharding": using separate
database files for
> different subdomains. For example, the server might have a separate
SQLite database for each
> user, so that the server can handle hundreds or thousands of simultaneous
connections, but
> each SQLite database is only used by one connection.

In my case I would be doing sharding on the data source and/or the day of
the timestamp, so to have individual files in the size of a few hundreds MB.
This way, deleting the oldest data would be as easy as deleting the
corresponding file.

However, I did not find any reference whatsoever on sharding being
available _within_ sqlite.
Ideally, I would like to have a way of "seeing" the whole dataset with a
single query spanning all  available databases.

Would that be at all feasible? I saw the "attach database" statement which
seems closely related but whose use-case I honestly don't get.
If not, is there any state-of-the-art adapter layer that would be
performing (and hide) the underlying sharding? I don't really care about
query performance (e.g. if such a global query spanning 20 different
databases is indeed performed serially, thereby take 20 times longer), I
just need a way of hiding this detail.

I saw some reference to SPHiveDB
https://www.mail-archive.com/sqlite-users@.../msg43575.html
but the project looks stale (9 years since the last commit).

I also looked into AtomDB but it looks overly complicated for my use-case
(single, embedded server), plus it somehow requires the underlying sharding
to be totally exposed.

Any ideas?
Gerlando
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Sqlite Sharding HOWTO

R Smith-2

On 2018/07/29 10:34 AM, Gerlando Falauto wrote:
> Hi,
>
> I'm totally new to sqlite and I'd like to use it for some logging

Welcome Gerlando. :)

> application on an embedded
> linux-based device.  Data comes from multiple (~10), similar sources at a
> steady rate.
> The rolling data set would be in the size of 20 GB. Such an amount of
> storage would suffice to retain data from the previous week or so.
>
> Reading the documentation https://www.sqlite.org/whentouse.html somehow
> suggests the usage of sharding:
>
>> Concurrency is also improved by "database sharding": using separate
> database files for
>> different subdomains. For example, the server might have a separate
> SQLite database for each
>> user, so that the server can handle hundreds or thousands of simultaneous
> connections, but
>> each SQLite database is only used by one connection.
> In my case I would be doing sharding on the data source and/or the day of
> the timestamp, so to have individual files in the size of a few hundreds MB.
> This way, deleting the oldest data would be as easy as deleting the
> corresponding file.

I think you are perhaps missing a core idea here - the only use-case
that requires sharding is where you have very high write-concurrency
from multiple sources, and even then, the sharding, in order to have any
helpful effect, needs to distinguish "write sources", not events or
time-frames or such.

SQLite will very happily run a 20GB (or much larger) database written to
from many sources, and very happily delete old data from it and pump new
data in without much in the way of "help" needed, AND then produce fast
queries without much fanfare.

The question that needs to be answered specifically is: How many data
input sources are there? as in how many Processes will attempt to write
to the database at the same time? Two processes can obviously NOT write
at the same time, so if a concurrent write happens, one process has to
wait a few milliseconds. This gets compounded as more and more write
sources are added or as write-frequency increases.

If a single process is writing data to a single DB from many different
sources, there is zero reason for sharding. If many processes are
running all with their own connection to the DB, AND they have high
concurrency (i.e. high frequency updates from many DB connections which
heightens the incidence of simultaneous write attempts to a single DB
file) then it starts becoming a good idea to allocate two or more DB
files so that we split the connections between those files, effectively
lowering the write-collision frequency for a single file.

Incidentally, all DBs offer some form of concurrency alleviation (load
balancing, partitioning, etc.) which often also serves other purposes.

To get to the point: With the above in mind, do you still feel like you
need to go the sharding route? Could you perhaps quote figures such as
how many bytes would a typical data update be? How many updates per
second, from how many different Processes?  Do you have a maintenance
Window (as in, is there a time of day or on a weekend or such that you
can go a few minutes without logging so one can clean up the old logs,
perhaps Vacuum and re-Analyze?

This will allow much better advice, and someone on here is bound to
already have something just like that running and will be able to
quickly give some experience hints.


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Sqlite Sharding HOWTO

Abroży Nieprzełoży
In reply to this post by Gerlando Falauto
> Ideally, I would like to have a way of "seeing" the whole dataset with a
> single query spanning all  available databases.

I think swarmvtab may be helpful.
https://www.sqlite.org/swarmvtab.html


2018-07-29 10:34 GMT+02:00, Gerlando Falauto <[hidden email]>:

> Hi,
>
> I'm totally new to sqlite and I'd like to use it for some logging
> application on an embedded
> linux-based device.  Data comes from multiple (~10), similar sources at a
> steady rate.
> The rolling data set would be in the size of 20 GB. Such an amount of
> storage would suffice to retain data from the previous week or so.
>
> Reading the documentation https://www.sqlite.org/whentouse.html somehow
> suggests the usage of sharding:
>
>>Concurrency is also improved by "database sharding": using separate
> database files for
>> different subdomains. For example, the server might have a separate
> SQLite database for each
>> user, so that the server can handle hundreds or thousands of simultaneous
> connections, but
>> each SQLite database is only used by one connection.
>
> In my case I would be doing sharding on the data source and/or the day of
> the timestamp, so to have individual files in the size of a few hundreds MB.
> This way, deleting the oldest data would be as easy as deleting the
> corresponding file.
>
> However, I did not find any reference whatsoever on sharding being
> available _within_ sqlite.
> Ideally, I would like to have a way of "seeing" the whole dataset with a
> single query spanning all  available databases.
>
> Would that be at all feasible? I saw the "attach database" statement which
> seems closely related but whose use-case I honestly don't get.
> If not, is there any state-of-the-art adapter layer that would be
> performing (and hide) the underlying sharding? I don't really care about
> query performance (e.g. if such a global query spanning 20 different
> databases is indeed performed serially, thereby take 20 times longer), I
> just need a way of hiding this detail.
>
> I saw some reference to SPHiveDB
> https://www.mail-archive.com/sqlite-users@.../msg43575.html
> but the project looks stale (9 years since the last commit).
>
> I also looked into AtomDB but it looks overly complicated for my use-case
> (single, embedded server), plus it somehow requires the underlying sharding
> to be totally exposed.
>
> Any ideas?
> Gerlando
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Sqlite Sharding HOWTO

Gerlando Falauto
In reply to this post by Gerlando Falauto
Hi Ryan,

thank you for your reply.

>I think you are perhaps missing a core idea here - the only use-case
>that requires sharding is where you have very high write-concurrency
>from multiple sources, and even then, the sharding, in order to have any
>helpful effect, needs to distinguish "write sources", not events or
>time-frames or such.

"Write sources" *could* be different in my case (not in the first
version though).

>SQLite will very happily run a 20GB (or much larger) database written to
>from many sources, and very happily delete old data from it and pump new
>data in without much in the way of "help" needed, AND then produce fast
>queries without much fanfare.

I sort of realized that in the meantime (at least for the fast queries part) ;-)

>The question that needs to be answered specifically is: How many data
>input sources are there? as in how many Processes will attempt to write
>to the database at the same time? Two processes can obviously NOT write
>at the same time, so if a concurrent write happens, one process has to
>wait a few milliseconds. This gets compounded as more and more write
>sources are added or as write-frequency increases.

When you say "a few milliseconds", how exactly do you measure that?

What is the timeframe where the second process has to wait if the
first one is doing a batch of a thousand writes?


>If a single process is writing data to a single DB from many different
>sources, there is zero reason for sharding. If many processes are
>running all with their own connection to the DB, AND they have high
>concurrency (i.e. high frequency updates from many DB connections which
>heightens the incidence of simultaneous write attempts to a single DB
>file) then it starts becoming a good idea to allocate two or more DB
>files so that we split the connections between those files, effectively
>lowering the write-collision frequency for a single file.

>Incidentally, all DBs offer some form of concurrency alleviation (load
>balancing, partitioning, etc.) which often also serves other purposes.

>To get to the point: With the above in mind, do you still feel like you
>need to go the sharding route? Could you perhaps quote figures such as
>how many bytes would a typical data update be? How many updates per
>second, from how many different Processes?  Do you have a maintenance
>Window (as in, is there a time of day or on a weekend or such that you
>can go a few minutes without logging so one can clean up the old logs,
>perhaps Vacuum and re-Analyze?

In the current use case thre's a single process. The way I see it, in
the near future it would probably increase to 3-4 processes,
each doing 10-100 writes per second or so. Each write would be around
1KB-20KB (one single text field, I guess).
I wonder if writing data in batches would be helpful though.

But yeah, I guess sharding makes much less sense right now. :-)

>This will allow much better advice, and someone on here is bound to
>already have something just like that running and will be able to
>quickly give some experience hints.

Thank you again!

Gerlando
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Sqlite Sharding HOWTO

Keith Medcalf

>In the current use case thre's a single process. The way I see it, in
>the near future it would probably increase to 3-4 processes,
>each doing 10-100 writes per second or so. Each write would be around
>1KB-20KB (one single text field, I guess).
>I wonder if writing data in batches would be helpful though.

Why do you think you want more than one process writing to the database?

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.




_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Sqlite Sharding HOWTO

R Smith-2
In reply to this post by Gerlando Falauto
On 2018/07/30 12:39 AM, Gerlando Falauto wrote:

>
>> The question that needs to be answered specifically is: How many data
>> input sources are there? as in how many Processes will attempt to write
>> to the database at the same time? Two processes can obviously NOT write
>> at the same time, so if a concurrent write happens, one process has to
>> wait a few milliseconds. This gets compounded as more and more write
>> sources are added or as write-frequency increases.
> When you say "a few milliseconds", how exactly do you measure that?
>
> What is the timeframe where the second process has to wait if the
> first one is doing a batch of a thousand writes?

This is very hard to say, hence my other questions in this regard. The
time needed depends firstly on how much data needs to be inserted, then,
how many inserts are required (if batched), then how many Indexes do you
have on the table (each Index need to be modified for each insert), then
does the inserts contain only data, or perhaps functions that need to be
calculated (especially for UDFs) and then, are there any triggers that
need to run upon Inserting, and most importantly of all, how quick is
the hardware that has to drink all this data?  (The hardware IO is the
typical bottleneck).

A query doing a single insert of a few bytes with no Indexes, no
triggers, no functions will be stupendously fast, whereas any increase
in one or more of the above will slow things down. How much exactly is
something you need to test, any guesswork will not be useful. What I can
say is that if you don't have any Unique Indexes or triggers, the insert
speed /should/ not change much with size.

>
>> To get to the point: With the above in mind, do you still feel like you
>> need to go the sharding route? Could you perhaps quote figures such as
>> how many bytes would a typical data update be? How many updates per
>> second, from how many different Processes?  Do you have a maintenance
>> Window (as in, is there a time of day or on a weekend or such that you
>> can go a few minutes without logging so one can clean up the old logs,
>> perhaps Vacuum and re-Analyze?
> In the current use case thre's a single process. The way I see it, in
> the near future it would probably increase to 3-4 processes,

Why? Are you adding more other logging sources? Or is this just a
feeling that more processes would handle better in some way?

> each doing 10-100 writes per second or so. Each write would be around
> 1KB-20KB (one single text field, I guess).
100 x 20KB inserts = 2MB Inserted x 4 processes = 8MB per second
inserted issuing around 400 file-locks... Let's assume only 1 Index
(apart from the rowid) and no triggers, then this should be easily
handled, even on slower hardware. This is significant enough though that
it should be tested before-hand with real-world operating parameters in
place.

> I wonder if writing data in batches would be helpful though.
Most certainly. Less file-lock operations is better. Plan anything you
do to collect logs together and fire them into the DB at say 1 to 5
second intervals in single transactions.
How to decide the interval: How much data-time can you afford to lose if
the system crashes? How fast is the data needed downstream by something
reading the DB?

> But yeah, I guess sharding makes much less sense right now. :-)

Agreed.

Cheers,
Ryan

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Sqlite Sharding HOWTO

Simon Slavin-3
In reply to this post by Gerlando Falauto
On 29 Jul 2018, at 11:39pm, Gerlando Falauto <[hidden email]> wrote:

> In the current use case thre's a single process. The way I see it, in
> the near future it would probably increase to 3-4 processes,
> each doing 10-100 writes per second or so. Each write would be around
> 1KB-20KB (one single text field, I guess).
> I wonder if writing data in batches would be helpful though.

The critical path in SQLite write operations is your storage device.  Your entire database is on one drive, your CPU has one path to it, and the processing code in SQLite is very efficient.  Consequently most of the waiting time during API execution is spent waiting for hardware to say "Okay, I've written that.".  The number of indexes can make a big difference since it increases the number of database pages which must be written.

So you can make a good guess whether sharding, multi-processing, or other parallel shenanigans will help.  Just imagine it's 10% processing and 90% hardware access.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: [EXTERNAL] Sqlite Sharding HOWTO

Hick Gunter
In reply to this post by Gerlando Falauto
We almost exclusively use virtual tables in our application, and this includes virtual table code to access Faircom CTree files and in-memory data dictionaries. The structure (fields, indexes) of these tables is fixed (and identical for corresponding CTree and DD tables), with sharding achieved by "cloning" the common structure into separate tables, with the "clone parameters" that describe the value(s) of certain key field(s) being present in the name. E.g. the table named customer_PA would contain only customers from Pennsylvania. Creating a simple view (select * from customer_PA union all customer_NY ...) has the drawback of acessing all member tables, even if the constraints would require searching only one table. It also requires that all tables be contained in the same database.

Our solution is a "partition" provider that knows about "member tables" and "clone parameters" and can handle "partition constraints" as well as ordered (merge) and unorded (sequential) retrieval. The name of the "partition" table does not include any "clone parameters" (e.g customer).

So "SELECT * FROM customer;" will internally do "SELECT * FROM customer_NY;" followed by "SELECT * FROM customer_PA;" because the member table has 2 entries ('customer','customer_NY'), ('customer','customer_PA').

But "SELECT * FROM customer WHERE ... state = 'NY';" would determine that the "clone parameter" state only matches table customer_NY and therefore only query that table.

And "SELECT * FROM customer ... ORDER BY name;" would prepare identical statements against both tables, fetch a record from each and return the "smaller" one (because the virtual table supports indexing by name, the xBestIndex method can tell SQLite that it can handle this kind of query and sets the "orderByConsumed" flag; if the ODER BY expression cannot be handled via an index, it goes back to sequential execution and lets SQLite do the sorting). An n-way merge is implemented as a binary tree to minimize comparisons.

A smilar approach may be possible with native tables that reside in different native database files (limited by the maximum number of concurrently attached databases).


-----Ursprüngliche Nachricht-----
Von: sqlite-users [mailto:[hidden email]] Im Auftrag von Gerlando Falauto
Gesendet: Sonntag, 29. Juli 2018 10:34
An: [hidden email]
Betreff: [EXTERNAL] [sqlite] Sqlite Sharding HOWTO

Hi,

I'm totally new to sqlite and I'd like to use it for some logging application on an embedded linux-based device.  Data comes from multiple (~10), similar sources at a steady rate.
The rolling data set would be in the size of 20 GB. Such an amount of storage would suffice to retain data from the previous week or so.

Reading the documentation https://www.sqlite.org/whentouse.html somehow suggests the usage of sharding:

>Concurrency is also improved by "database sharding": using separate
database files for
> different subdomains. For example, the server might have a separate
SQLite database for each
> user, so that the server can handle hundreds or thousands of
> simultaneous
connections, but
> each SQLite database is only used by one connection.

In my case I would be doing sharding on the data source and/or the day of the timestamp, so to have individual files in the size of a few hundreds MB.
This way, deleting the oldest data would be as easy as deleting the corresponding file.

However, I did not find any reference whatsoever on sharding being available _within_ sqlite.
Ideally, I would like to have a way of "seeing" the whole dataset with a single query spanning all  available databases.

Would that be at all feasible? I saw the "attach database" statement which seems closely related but whose use-case I honestly don't get.
If not, is there any state-of-the-art adapter layer that would be performing (and hide) the underlying sharding? I don't really care about query performance (e.g. if such a global query spanning 20 different databases is indeed performed serially, thereby take 20 times longer), I just need a way of hiding this detail.

I saw some reference to SPHiveDB
https://www.mail-archive.com/sqlite-users@.../msg43575.html
but the project looks stale (9 years since the last commit).

I also looked into AtomDB but it looks overly complicated for my use-case (single, embedded server), plus it somehow requires the underlying sharding to be totally exposed.

Any ideas?
Gerlando
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


___________________________________________
 Gunter Hick | Software Engineer | Scientific Games International GmbH | Klitschgasse 2-4, A-1130 Vienna | FN 157284 a, HG Wien, DVR: 0430013 | (O) +43 1 80100 - 0

May be privileged. May be confidential. Please delete if not the addressee.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Sqlite Sharding HOWTO

Gerlando Falauto
In reply to this post by R Smith-2
On Mon, Jul 30, 2018 at 1:58 AM, R Smith <[hidden email]> wrote:

> On 2018/07/30 12:39 AM, Gerlando Falauto wrote:
>
>>
>> The question that needs to be answered specifically is: How many data
>>> input sources are there? as in how many Processes will attempt to write
>>> to the database at the same time? Two processes can obviously NOT write
>>> at the same time, so if a concurrent write happens, one process has to
>>> wait a few milliseconds. This gets compounded as more and more write
>>> sources are added or as write-frequency increases.
>>>
>> When you say "a few milliseconds", how exactly do you measure that?
>>
>> What is the timeframe where the second process has to wait if the
>> first one is doing a batch of a thousand writes?
>>
>
> This is very hard to say, hence my other questions in this regard. The
> time needed depends firstly on how much data needs to be inserted, then,
> how many inserts are required (if batched), then how many Indexes do you
> have on the table (each Index need to be modified for each insert), then
> does the inserts contain only data, or perhaps functions that need to be
> calculated (especially for UDFs) and then, are there any triggers that need
> to run upon Inserting, and most importantly of all, how quick is the
> hardware that has to drink all this data?  (The hardware IO is the typical
> bottleneck).
>
> A query doing a single insert of a few bytes with no Indexes, no triggers,
> no functions will be stupendously fast, whereas any increase in one or more
> of the above will slow things down. How much exactly is something you need
> to test, any guesswork will not be useful. What I can say is that if you
> don't have any Unique Indexes or triggers, the insert speed /should/ not
> change much with size.
>

I see, thanks for the heads up. What's with Unique Indexes though? Why are
they any worse than a regular index?



>
>> To get to the point: With the above in mind, do you still feel like you
>>> need to go the sharding route? Could you perhaps quote figures such as
>>> how many bytes would a typical data update be? How many updates per
>>> second, from how many different Processes?  Do you have a maintenance
>>> Window (as in, is there a time of day or on a weekend or such that you
>>> can go a few minutes without logging so one can clean up the old logs,
>>> perhaps Vacuum and re-Analyze?
>>>
>> In the current use case thre's a single process. The way I see it, in
>> the near future it would probably increase to 3-4 processes,
>>
>
> Why? Are you adding more other logging sources? Or is this just a feeling
> that more processes would handle better in some way?
>

Could be adding more logging sources in the near future.



>
> each doing 10-100 writes per second or so. Each write would be around
>> 1KB-20KB (one single text field, I guess).
>>
> 100 x 20KB inserts = 2MB Inserted x 4 processes = 8MB per second inserted
> issuing around 400 file-locks... Let's assume only 1 Index (apart from the
> rowid) and no triggers, then this should be easily handled, even on slower
> hardware. This is significant enough though that it should be tested
> before-hand with real-world operating parameters in place.
>

I see.


> I wonder if writing data in batches would be helpful though.
>>
> Most certainly. Less file-lock operations is better. Plan anything you do
> to collect logs together and fire them into the DB at say 1 to 5 second
> intervals in single transactions.
> How to decide the interval: How much data-time can you afford to lose if
> the system crashes? How fast is the data needed downstream by something
> reading the DB?
>
>
1 second would be my educated guess.
While on the topic, I'm still not 100% sure I got the table schema right,
perhaps you can shed some more light.
The data I'd like to collect would be some sampled sensor data (sampling
rate being 4000Hz in the worst case).
One approach would be to timestamp each and every sample. This would be
very convenient for subsequent analysis, but I'm afraid it would kill
performance and waste a lot of disk space.
A different approach would be to produce way fewer rows (let's say, again,
one per second) and storing a whole bunch of samples in a single record.
But in this case I should find some way of
a) storing this data (JSON, BLOB...)
b) flattening (or pivoting?) it back so that consecutive rows could be
analyzed in sequence as a huge set of samples

Any suggestion about that?

Thanks again!
Gerlando



> But yeah, I guess sharding makes much less sense right now. :-)
>>
>
> Agreed.
>
> Cheers,
> Ryan
>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Sqlite Sharding HOWTO

Keith Medcalf

>> A query doing a single insert of a few bytes with no Indexes, no
>> triggers, no functions will be stupendously fast, whereas any
>> increase in one or more of the above will slow things down.
>> How much exactly is something you need to test, any guesswork
>> will not be useful. What I can say is that if you don't have
>> any Unique Indexes or triggers, the insert speed /should/ not
>> change much with size.

>I see, thanks for the heads up. What's with Unique Indexes though?
>Why are they any worse than a regular index?

All are indexes are unique indexes because the rowid is appended to the end of the key (so the original row can be found) thus making every entry unique.

For an index that is "unique" in the payload (ie, exclusive of the rowid) a seek of the index must be done on insertion to check for duplicate payloads so that a uniqueness violation can be thrown.  This is not required for a non-unique index since the addition of the rowid to the payload ensures uniqueness of the key -- unless the datastore is corrupt, of course, but we will ignore that for now.

That is, for

CREATE INDEX A on B (C,D) the row can be inserted in the index without checking for a uniqueness violation since the actual key is C, D, rowid.
CREATE UNIQUE INDEX A on B (C,D) requires that the key (C, D) be looked up to ensure there is not a uniqueness violation.

so for an index declared to be unique an extra b-tree search is required and that takes a wee bit of time ...

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.






_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Sqlite Sharding HOWTO

Gerlando Falauto
On Mon, Jul 30, 2018 at 9:19 PM, Keith Medcalf <[hidden email]> wrote:

>
> >> A query doing a single insert of a few bytes with no Indexes, no
> >> triggers, no functions will be stupendously fast, whereas any
> >> increase in one or more of the above will slow things down.
> >> How much exactly is something you need to test, any guesswork
> >> will not be useful. What I can say is that if you don't have
> >> any Unique Indexes or triggers, the insert speed /should/ not
> >> change much with size.
>
> >I see, thanks for the heads up. What's with Unique Indexes though?
> >Why are they any worse than a regular index?
>
> All are indexes are unique indexes because the rowid is appended to the
> end of the key (so the original row can be found) thus making every entry
> unique.
>
> For an index that is "unique" in the payload (ie, exclusive of the rowid)
> a seek of the index must be done on insertion to check for duplicate
> payloads so that a uniqueness violation can be thrown.  This is not
> required for a non-unique index since the addition of the rowid to the
> payload ensures uniqueness of the key -- unless the datastore is corrupt,
> of course, but we will ignore that for now.
>
> That is, for
>
> CREATE INDEX A on B (C,D) the row can be inserted in the index without
> checking for a uniqueness violation since the actual key is C, D, rowid.
> CREATE UNIQUE INDEX A on B (C,D) requires that the key (C, D) be looked up
> to ensure there is not a uniqueness violation.
>
> so for an index declared to be unique an extra b-tree search is required
> and that takes a wee bit of time ...
>

I see, thanks for the explanation.
Does that apply to the primary key as well?
I'm assuming setting some columns as primary key will implicitly create a
unique index, but please correct me if I'm wrong.
Long time since my database class...

Thanks again!
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Sqlite Sharding HOWTO

David Raymond
In reply to this post by Keith Medcalf
Doesn't sound quite right to me.

No matter the index you have to search through it to find the spot to do the insert. Both are going to do that search only once. An insert on a unique index isn't going to search through it for existence, then promptly forget what it just did and do it all over again to do the insert. It's going to start the insert and find the spot where the new item would go. If the spot's free it succeeds, if it's taken then it fails. There is no need for a second search.


-----Original Message-----
From: sqlite-users [mailto:[hidden email]] On Behalf Of Keith Medcalf
Sent: Monday, July 30, 2018 3:19 PM
To: SQLite mailing list
Subject: Re: [sqlite] Sqlite Sharding HOWTO


>> A query doing a single insert of a few bytes with no Indexes, no
>> triggers, no functions will be stupendously fast, whereas any
>> increase in one or more of the above will slow things down.
>> How much exactly is something you need to test, any guesswork
>> will not be useful. What I can say is that if you don't have
>> any Unique Indexes or triggers, the insert speed /should/ not
>> change much with size.

>I see, thanks for the heads up. What's with Unique Indexes though?
>Why are they any worse than a regular index?

All are indexes are unique indexes because the rowid is appended to the end of the key (so the original row can be found) thus making every entry unique.

For an index that is "unique" in the payload (ie, exclusive of the rowid) a seek of the index must be done on insertion to check for duplicate payloads so that a uniqueness violation can be thrown.  This is not required for a non-unique index since the addition of the rowid to the payload ensures uniqueness of the key -- unless the datastore is corrupt, of course, but we will ignore that for now.

That is, for

CREATE INDEX A on B (C,D) the row can be inserted in the index without checking for a uniqueness violation since the actual key is C, D, rowid.
CREATE UNIQUE INDEX A on B (C,D) requires that the key (C, D) be looked up to ensure there is not a uniqueness violation.

so for an index declared to be unique an extra b-tree search is required and that takes a wee bit of time ...

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.






_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Sqlite Sharding HOWTO

Gerlando Falauto
On Mon, Jul 30, 2018 at 9:42 PM, David Raymond <[hidden email]>
wrote:

> Doesn't sound quite right to me.
>
> No matter the index you have to search through it to find the spot to do
> the insert. Both are going to do that search only once. An insert on a
> unique index isn't going to search through it for existence, then promptly
> forget what it just did and do it all over again to do the insert. It's
> going to start the insert and find the spot where the new item would go. If
> the spot's free it succeeds, if it's taken then it fails. There is no need
> for a second search.
>
>
Hence my original question: why would a Unique Index be any worse?
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Sqlite Sharding HOWTO

David Raymond
I don't believe it is any worse.


Question for devs: Would it be considered an optimization opportunity to push UNIQUE index inserts to the front, so that if something's going to fail then it fails sooner rather than later?

In this oversimplified example the explain output shows it does the unique check on the rowid first, which is good, but then does the normal index insert before the unique index insert. If there's going to be a uniqueness violation then the time spent on the non-unique index inserts would be "wasted" time, albeit a very small amount of it.


SQLite version 3.24.0 2018-06-04 19:24:41
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.

sqlite> create table tbl (id integer primary key, foo unique, bar);
QUERY PLAN
`--SEARCH TABLE sqlite_master USING INTEGER PRIMARY KEY (rowid=?)

sqlite> create index bar_idx on tbl (bar);

sqlite> explain insert into tbl values (1, 2, 3);
addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     30    0                    00  Start at 30
1     OpenWrite      0     2     0     3              00  root=2 iDb=0; tbl
2     OpenWrite      1     4     0     k(2,,)         00  root=4 iDb=0; bar_idx
3     OpenWrite      2     3     0     k(2,,)         00  root=3 iDb=0; sqlite_autoindex_tbl_1
4     Integer        1     1     0                    00  r[1]=1
5     NotNull        1     7     0                    00  if r[1]!=NULL goto 7
6     NewRowid       0     1     0                    00  r[1]=rowid
7     MustBeInt      1     0     0                    00
8     SoftNull       2     0     0                    00  r[2]=NULL
9     Integer        2     3     0                    00  r[3]=2
10    Integer        3     4     0                    00  r[4]=3
11    Noop           0     0     0                    00  uniqueness check for ROWID
12    NotExists      0     14    1                    00  intkey=r[1]
13    Halt           1555  2     0     tbl.id         02
14    Noop           0     0     0                    00  uniqueness check for bar_idx
15    Affinity       2     1     0     D              00  affinity(r[2])
16    SCopy          4     6     0                    00  r[6]=r[4]; bar
17    IntCopy        1     7     0                    00  r[7]=r[1]; rowid
18    MakeRecord     6     2     5                    00  r[5]=mkrec(r[6..7]); for bar_idx
19    Noop           0     0     0                    00  uniqueness check for sqlite_autoindex_tbl_1
20    SCopy          3     9     0                    00  r[9]=r[3]; foo
21    IntCopy        1     10    0                    00  r[10]=r[1]; rowid
22    MakeRecord     9     2     8                    00  r[8]=mkrec(r[9..10]); for sqlite_autoindex_tbl_1
23    NoConflict     2     25    9     1              00  key=r[9]
24    Halt           2067  2     0     tbl.foo        02
25    IdxInsert      1     5     6     2              10  key=r[5]
26    IdxInsert      2     8     9     2              10  key=r[8]
27    MakeRecord     2     3     11                   00  r[11]=mkrec(r[2..4])
28    Insert         0     11    1     tbl            31  intkey=r[1] data=r[11]
29    Halt           0     0     0                    00
30    Transaction    0     1     2     0              01  usesStmtJournal=0
31    Goto           0     1     0                    00

sqlite>





-----Original Message-----
From: sqlite-users [mailto:[hidden email]] On Behalf Of Gerlando Falauto
Sent: Monday, July 30, 2018 3:46 PM
To: SQLite mailing list
Subject: Re: [sqlite] Sqlite Sharding HOWTO

On Mon, Jul 30, 2018 at 9:42 PM, David Raymond <[hidden email]>
wrote:

> Doesn't sound quite right to me.
>
> No matter the index you have to search through it to find the spot to do
> the insert. Both are going to do that search only once. An insert on a
> unique index isn't going to search through it for existence, then promptly
> forget what it just did and do it all over again to do the insert. It's
> going to start the insert and find the spot where the new item would go. If
> the spot's free it succeeds, if it's taken then it fails. There is no need
> for a second search.
>
>
Hence my original question: why would a Unique Index be any worse?
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Sqlite Sharding HOWTO

Simon Slavin-3
In reply to this post by Gerlando Falauto
On 30 Jul 2018, at 8:38pm, Gerlando Falauto <[hidden email]> wrote:

> Does that apply to the primary key as well?

Primary key indexes are unique indexes, since SQLite has to enforce the primary key being unique.

Howwever, I do not think there can be such a strong penalty for indexes being UNIQUE.  I side with David Raymond's opinion that SQLite would not search them twice when inserting a new row.  That's merely opinion, though, since I have not read the source code.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Sqlite Sharding HOWTO

R Smith-2
In reply to this post by Gerlando Falauto
On 2018/07/30 9:45 PM, Gerlando Falauto wrote:

> On Mon, Jul 30, 2018 at 9:42 PM, David Raymond <[hidden email]>
> wrote:
>
>> Doesn't sound quite right to me.
>>
>> No matter the index you have to search through it to find the spot to do
>> the insert. Both are going to do that search only once. An insert on a
>> unique index isn't going to search through it for existence, then promptly
>> forget what it just did and do it all over again to do the insert. It's
>> going to start the insert and find the spot where the new item would go. If
>> the spot's free it succeeds, if it's taken then it fails. There is no need
>> for a second search.
>>
>>
> Hence my original question: why would a Unique Index be any worse?

I didn't argue the difference was remarkable, neither did Keith, I
believe he even used the words "wee bit", but let me try a simpler way
with a more convoluted example...

INSERT method 1 - Inserting a new item into a Table that has 2 Unique
Indexes (this may not be SQLite's exact way, but simply an optimal way
to do it):
First, search the BTree Index 1, see if a duplicate exists,
Search the BTree of Index 2, see if that already exists,
if both do not exist, proceed with adding the data to the table,
get new row_id reference (a B-Tree insert by itself),
Insert the Unique index 1 and Unique Index 2 and row-id reference in the
BTrees
(Quite possibly optimized by building a pointer-array of unique indexes
and remembering where all of them had been before after initial traversal)

INSERT method 2 - Inserting a new item into a Table with 2 non-unique
Indexes:
Proceed immediately with adding the data to the table (no fallible
constraint),
get new row_id reference and append it to the Index-values,
Insert index 1 and Index 2 with appended row-id reference in their Index
Trees.

Not only that, but if carnality for the non-Unique indexes are very low,
then the speed at which an indexed position can be found, reduces
significantly[*].

My point when I mentioned it originally was not to claim the Unique
Indexes are much worse, they are not, but I tried to point out to the OP
that even with significant size gains, insert speed should not be
affected much and should not be feared, but then I had to consider that
that statement is less accurate when we add Unique Indexes, especially
multiples of them - that's the only reason for the added distinction.

Hope that's more clear and accurate,
Ryan

[*] - Here I was under the impression the Index is simply added-to if
the bit without the rowid found the first match. If the row-id itself
has to still be sorted into place, that reduction is lessened, since
sorting a rowid is still faster than a text field, but not significantly so.






_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users