Multiple Independent Database Instances

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Multiple Independent Database Instances

Lee, Jason
Hi. Are there any gotchas when opening multiple independent databases from within one process using the C API? I am opening one database per thread in my code, and noticed that sqlite3_open_v2 and sqlite3_close slow down as the number of threads increase, indicating there might be some resource contention somewhere, even though the databases should be independent of each other.


Jason Lee
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Multiple Independent Database Instances

Simon Slavin-3
On 22 Apr 2019, at 7:39pm, Lee, Jason <[hidden email]> wrote:

> Hi. Are there any gotchas when opening multiple independent databases from within one process using the C API? I am opening one database per thread in my code, and noticed that sqlite3_open_v2 and sqlite3_close slow down as the number of threads increase, indicating there might be some resource contention somewhere, even though the databases should be independent of each other.

SQLite is designed to cope with the opening of many different databases at the same time.  SQLite does not maintain its own list of open connections, so it doesn't have to iterate through the list each time you execute another command.

Those routines will get slower, since your operating system keeps a list of open files and has to walk through the list.  But that should be tiny and unnoticeable in any modern OS.

Can you give us any numbers as examples ?  I presume your setup gives you one thread per database.  Are you sure that this slowdown is not just the result of some OS function, i.e. you've run out of real memory and the OS has to keep swapping every time you switch to another thread ?
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Multiple Independent Database Instances

Jens Alfke-2
In reply to this post by Lee, Jason


> On Apr 22, 2019, at 11:39 AM, Lee, Jason <[hidden email]> wrote:
>
> Hi. Are there any gotchas when opening multiple independent databases from within one process using the C API?

Do you mean different database files, or multiple connections to the same file?

> I am opening one database per thread in my code, and noticed that sqlite3_open_v2 and sqlite3_close slow down as the number of threads increase, indicating there might be some resource contention somewhere, even though the databases should be independent of each other.

How many databases/threads? With huge numbers I’d expect slowdowns, since you’ll be bottlenecking on I/O.

But otherwise there shouldn’t be any gotchas. I would troubleshoot this by profiling the running code to see where the time is being spent. Even without knowledge of the SQLite internals, looking at the call stacks of the hot-spots can help identify what the problem is.

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Multiple Independent Database Instances

Lee, Jason
Thanks for the quick responses!


I am on a machine with many many cores, 500GB RAM, and lots of NVMe drives raided together, so the system should not be the issue. I have been testing with 16, 32, and 48 threads/databases at once, and the cumulative time it takes for all of the threads to just open all (millions) of the databases goes from 1200 seconds to 2200 seconds to 3300 seconds.


As mentioned, this is likely to be something else, but I was hoping that I was somehow using SQLite wrong.


Jason Lee

________________________________
From: sqlite-users <[hidden email]> on behalf of Jens Alfke <[hidden email]>
Sent: Monday, April 22, 2019 12:52:28 PM
To: SQLite mailing list
Subject: Re: [sqlite] Multiple Independent Database Instances



> On Apr 22, 2019, at 11:39 AM, Lee, Jason <[hidden email]> wrote:
>
> Hi. Are there any gotchas when opening multiple independent databases from within one process using the C API?

Do you mean different database files, or multiple connections to the same file?

> I am opening one database per thread in my code, and noticed that sqlite3_open_v2 and sqlite3_close slow down as the number of threads increase, indicating there might be some resource contention somewhere, even though the databases should be independent of each other.

How many databases/threads? With huge numbers I’d expect slowdowns, since you’ll be bottlenecking on I/O.

But otherwise there shouldn’t be any gotchas. I would troubleshoot this by profiling the running code to see where the time is being spent. Even without knowledge of the SQLite internals, looking at the call stacks of the hot-spots can help identify what the problem is.

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Multiple Independent Database Instances

Simon Slavin-3
On 22 Apr 2019, at 9:08pm, Lee, Jason <[hidden email]> wrote:

> the cumulative time it takes for all of the threads to just open all (millions) of the databases goes from 1200 seconds to 2200 seconds to 3300 seconds.

I'm guessing that it's the number of file handles which increases.  Most OSes maintain a linked list of file buffer metadata.  Opening a thousandth file involves offsetting for a thousand pieces of file metadata.  Use 'lsof' and check the output.

You might try to explore this problem by writing a program to open the same number of text files at once.  See whether that program gets slower similarly to the SQLite one.

The exception would be if you're using the C API for SQLite without a library suited to your programming language.  The C interface to SQLite does not actually open a file just because you told it to.  The file is actually opened the first time SQLite needs to do something to it.  So a million uses of sqlite3_open_v2 might still do no file access.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Multiple Independent Database Instances

Keith Medcalf
In reply to this post by Lee, Jason

This is somewhat unclear.  You make two conflicting statements:

"I have been testing with 16, 32, and 48 threads/databases at once ..."
and
"time it takes for all of the threads to just open all (millions) of the databases"

So, are you:
(a) opening one independently and uniquely named database per thread as would be apparent from the first conflicting statement above; or,
(b) opening the same "millions" of databases per thread as indicated by the second conflicting statement above

?

Per my testing the time taken to spin up a thread and open a database with a unique database name is constant and linear up to the thread limit of a process (about 800 threads).  This is even true if you execute SQL against the connection within that thread and also count that execution time in the "Time Taken".

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.

>-----Original Message-----
>From: sqlite-users [mailto:sqlite-users-
>[hidden email]] On Behalf Of Lee, Jason
>Sent: Monday, 22 April, 2019 14:08
>To: SQLite mailing list
>Subject: Re: [sqlite] Multiple Independent Database Instances
>
>Thanks for the quick responses!
>
>
>I am on a machine with many many cores, 500GB RAM, and lots of NVMe
>drives raided together, so the system should not be the issue. I have
>been testing with 16, 32, and 48 threads/databases at once, and the
>cumulative time it takes for all of the threads to just open all
>(millions) of the databases goes from 1200 seconds to 2200 seconds to
>3300 seconds.
>
>
>As mentioned, this is likely to be something else, but I was hoping
>that I was somehow using SQLite wrong.
>
>
>Jason Lee
>
>________________________________
>From: sqlite-users <[hidden email]> on
>behalf of Jens Alfke <[hidden email]>
>Sent: Monday, April 22, 2019 12:52:28 PM
>To: SQLite mailing list
>Subject: Re: [sqlite] Multiple Independent Database Instances
>
>
>
>> On Apr 22, 2019, at 11:39 AM, Lee, Jason <[hidden email]> wrote:
>>
>> Hi. Are there any gotchas when opening multiple independent
>databases from within one process using the C API?
>
>Do you mean different database files, or multiple connections to the
>same file?
>
>> I am opening one database per thread in my code, and noticed that
>sqlite3_open_v2 and sqlite3_close slow down as the number of threads
>increase, indicating there might be some resource contention
>somewhere, even though the databases should be independent of each
>other.
>
>How many databases/threads? With huge numbers I’d expect slowdowns,
>since you’ll be bottlenecking on I/O.
>
>But otherwise there shouldn’t be any gotchas. I would troubleshoot
>this by profiling the running code to see where the time is being
>spent. Even without knowledge of the SQLite internals, looking at the
>call stacks of the hot-spots can help identify what the problem is.
>
>—Jens
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Multiple Independent Database Instances

Lee, Jason
I have a set of several million database files sitting on my filesystem. Each thread will open a previously unprocessed database file, do some queries, close the database, and move on to the next unprocessed database file.


Jason Lee

________________________________
From: sqlite-users <[hidden email]> on behalf of Keith Medcalf <[hidden email]>
Sent: Monday, April 22, 2019 3:13:57 PM
To: SQLite mailing list
Subject: Re: [sqlite] Multiple Independent Database Instances


This is somewhat unclear.  You make two conflicting statements:

"I have been testing with 16, 32, and 48 threads/databases at once ..."
and
"time it takes for all of the threads to just open all (millions) of the databases"

So, are you:
(a) opening one independently and uniquely named database per thread as would be apparent from the first conflicting statement above; or,
(b) opening the same "millions" of databases per thread as indicated by the second conflicting statement above

?

Per my testing the time taken to spin up a thread and open a database with a unique database name is constant and linear up to the thread limit of a process (about 800 threads).  This is even true if you execute SQL against the connection within that thread and also count that execution time in the "Time Taken".

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.

>-----Original Message-----
>From: sqlite-users [mailto:sqlite-users-
>[hidden email]] On Behalf Of Lee, Jason
>Sent: Monday, 22 April, 2019 14:08
>To: SQLite mailing list
>Subject: Re: [sqlite] Multiple Independent Database Instances
>
>Thanks for the quick responses!
>
>
>I am on a machine with many many cores, 500GB RAM, and lots of NVMe
>drives raided together, so the system should not be the issue. I have
>been testing with 16, 32, and 48 threads/databases at once, and the
>cumulative time it takes for all of the threads to just open all
>(millions) of the databases goes from 1200 seconds to 2200 seconds to
>3300 seconds.
>
>
>As mentioned, this is likely to be something else, but I was hoping
>that I was somehow using SQLite wrong.
>
>
>Jason Lee
>
>________________________________
>From: sqlite-users <[hidden email]> on
>behalf of Jens Alfke <[hidden email]>
>Sent: Monday, April 22, 2019 12:52:28 PM
>To: SQLite mailing list
>Subject: Re: [sqlite] Multiple Independent Database Instances
>
>
>
>> On Apr 22, 2019, at 11:39 AM, Lee, Jason <[hidden email]> wrote:
>>
>> Hi. Are there any gotchas when opening multiple independent
>databases from within one process using the C API?
>
>Do you mean different database files, or multiple connections to the
>same file?
>
>> I am opening one database per thread in my code, and noticed that
>sqlite3_open_v2 and sqlite3_close slow down as the number of threads
>increase, indicating there might be some resource contention
>somewhere, even though the databases should be independent of each
>other.
>
>How many databases/threads? With huge numbers I’d expect slowdowns,
>since you’ll be bottlenecking on I/O.
>
>But otherwise there shouldn’t be any gotchas. I would troubleshoot
>this by profiling the running code to see where the time is being
>spent. Even without knowledge of the SQLite internals, looking at the
>call stacks of the hot-spots can help identify what the problem is.
>
>—Jens
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Multiple Independent Database Instances

Simon Slavin-3
On 22 Apr 2019, at 10:25pm, Lee, Jason <[hidden email]> wrote:

> I have a set of several million database files sitting on my filesystem. Each thread will open a previously unprocessed database file, do some queries, close the database, and move on to the next unprocessed database file.

If this process is getting slower and slower, you have a resource leak somewhere in your program.  It's possible to make SQLite do this using faulty programming.

For instance, you may have a statement that reads a table.  Statements much be finalized or reset using sqlite3_finalize() or sqlite3_reset() and you still need to do this even if SQLite returned  SQLITE_DONE to tell you there are no more rows to return.

If you do not do this, even though sqlite3_close() will run and return SQLITE_OK, it cannot release the resources used by the file because a statement is still pending.  It waits until the statement is terminated and then automatically closes the file.  So you get a resource leak until that's done.

You can start analysis by using a monitoring program to monitor memory usage of the process.  Does it gradually use more and more memory the longer it runs ?  If so, it shouldn't be too difficult to figure out what memory isn't being released.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Multiple Independent Database Instances

Keith Medcalf
In reply to this post by Lee, Jason

Interesting ... How does each thread know whether the file has been "previously processed" or not?

In other words, if you "get rid of" all the sqlite3 processing and replace it with a 5 ms sleep, does increasing the number of threads exhibit the same symptom?


That is if your thread code does this:

ThreadCode:
  while filename = getnextunprocessedfile()
     open the database filename
     do some stuff
     finalize all the statements
     close the database
     mark filename as being processed
  terminate cuz there is naught more to do

then replace it with this:

ThreadCode:
   while filename = getnextunprocessedfile()
     sleep 5 milliseconds
     mark filename as being processed
   terminate cuz there is naught more to do

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.


>-----Original Message-----
>From: sqlite-users [mailto:sqlite-users-
>[hidden email]] On Behalf Of Lee, Jason
>Sent: Monday, 22 April, 2019 15:26
>To: SQLite mailing list
>Subject: Re: [sqlite] Multiple Independent Database Instances
>
>I have a set of several million database files sitting on my
>filesystem. Each thread will open a previously unprocessed database
>file, do some queries, close the database, and move on to the next
>unprocessed database file.
>
>
>Jason Lee
>
>________________________________
>From: sqlite-users <[hidden email]> on
>behalf of Keith Medcalf <[hidden email]>
>Sent: Monday, April 22, 2019 3:13:57 PM
>To: SQLite mailing list
>Subject: Re: [sqlite] Multiple Independent Database Instances
>
>
>This is somewhat unclear.  You make two conflicting statements:
>
>"I have been testing with 16, 32, and 48 threads/databases at once
>..."
>and
>"time it takes for all of the threads to just open all (millions) of
>the databases"
>
>So, are you:
>(a) opening one independently and uniquely named database per thread
>as would be apparent from the first conflicting statement above; or,
>(b) opening the same "millions" of databases per thread as indicated
>by the second conflicting statement above
>
>?
>
>Per my testing the time taken to spin up a thread and open a database
>with a unique database name is constant and linear up to the thread
>limit of a process (about 800 threads).  This is even true if you
>execute SQL against the connection within that thread and also count
>that execution time in the "Time Taken".
>
>---
>The fact that there's a Highway to Hell but only a Stairway to Heaven
>says a lot about anticipated traffic volume.
>
>>-----Original Message-----
>>From: sqlite-users [mailto:sqlite-users-
>>[hidden email]] On Behalf Of Lee, Jason
>>Sent: Monday, 22 April, 2019 14:08
>>To: SQLite mailing list
>>Subject: Re: [sqlite] Multiple Independent Database Instances
>>
>>Thanks for the quick responses!
>>
>>
>>I am on a machine with many many cores, 500GB RAM, and lots of NVMe
>>drives raided together, so the system should not be the issue. I
>have
>>been testing with 16, 32, and 48 threads/databases at once, and the
>>cumulative time it takes for all of the threads to just open all
>>(millions) of the databases goes from 1200 seconds to 2200 seconds
>to
>>3300 seconds.
>>
>>
>>As mentioned, this is likely to be something else, but I was hoping
>>that I was somehow using SQLite wrong.
>>
>>
>>Jason Lee
>>
>>________________________________
>>From: sqlite-users <[hidden email]> on
>>behalf of Jens Alfke <[hidden email]>
>>Sent: Monday, April 22, 2019 12:52:28 PM
>>To: SQLite mailing list
>>Subject: Re: [sqlite] Multiple Independent Database Instances
>>
>>
>>
>>> On Apr 22, 2019, at 11:39 AM, Lee, Jason <[hidden email]>
>wrote:
>>>
>>> Hi. Are there any gotchas when opening multiple independent
>>databases from within one process using the C API?
>>
>>Do you mean different database files, or multiple connections to the
>>same file?
>>
>>> I am opening one database per thread in my code, and noticed that
>>sqlite3_open_v2 and sqlite3_close slow down as the number of threads
>>increase, indicating there might be some resource contention
>>somewhere, even though the databases should be independent of each
>>other.
>>
>>How many databases/threads? With huge numbers I’d expect slowdowns,
>>since you’ll be bottlenecking on I/O.
>>
>>But otherwise there shouldn’t be any gotchas. I would troubleshoot
>>this by profiling the running code to see where the time is being
>>spent. Even without knowledge of the SQLite internals, looking at
>the
>>call stacks of the hot-spots can help identify what the problem is.
>>
>>—Jens
>>_______________________________________________
>>sqlite-users mailing list
>>[hidden email]
>>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>>_______________________________________________
>>sqlite-users mailing list
>>[hidden email]
>>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
>
>
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Multiple Independent Database Instances

James K. Lowden
In reply to this post by Lee, Jason
On Mon, 22 Apr 2019 21:25:31 +0000
"Lee, Jason" <[hidden email]> wrote:

> I have a set of several million database files sitting on my
> filesystem. Each thread will open a previously unprocessed database
> file, do some queries, close the database, and move on to the next
> unprocessed database file.

Fascinating.  One wonders what Feynman would have said.  

Even with gobs of RAM and solid-state storage, I/O will quickly
bottleneck because the processor is 3 orders of magnitude faster than
RAM and 6 orders faster than the disk.  Once you exhaust the I/O bus,
it's exhausted.  

I would build a pipeline, and let processes do the work. Write a program
to process a single database: open, query, output, close.  Then define
a make(1) rule to convert one database into one output file.  Then run
"make -j dd" where "dd" is the number of simultaneous processes (or
"jobs").  I think you'll find ~10 processes is all you can sustain.  

You could use the sqlite3 utility as your "program", but it's not very
good at detecting errors and returning a nonzero return status to the
OS. Hence a bespoke program.  Also, you can get the data into binary
form, suitable for concatenation into one big file for input into your
numerical process.  That will go a lot faster.  

Although there's some overhead to invoking a million processes, it's
dwarfed by the I/O time.  

The advantage of doing the work under make is that it's reusable and
restartable.  if you bury the machine, you can kill make and restart it
with a lower number of jobs.  If you find some databases are corrupt or
incomplete, you can replace them, and make will reprocess only the new
ones.  If you add other databases at a later time, make will process
only those.  You can add subsequent steps, too; make won't start from
square 1 unless it has to.  

With millions of inputs, the odds are you will find problems.
Perfectly good input over a dataset that size probably occured before
in recorded history, but not frequently.  

I assume your millions of databases are not in a single directory; I'd
guess you have 1000s of directories.  They offer convenient work
partitions, which you might need; I have no idea how make will respond
to a dependency tree with millions of nodes.  

--jkl

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Multiple Independent Database Instances

Lee, Jason
> How does each thread know whether the file has been "previously processed" or not?


The paths are pushed onto a queue and each thread pops the top off. I am also looking into the queuing code to see if there are issues


> In other words, if you "get rid of" all the sqlite3 processing and replace it with a 5 ms sleep, does increasing the number of threads exhibit the same symptom?


The timings were for sqlite3_open_v2, not for the whole process. The current code is effectively just an sqlite3_open_v2 followed by an sqlite3_close, and yet the time it takes to complete sqlite3_open_v2 still increases with the number of threads.


> Even with gobs of RAM and solid-state storage, I/O will quickly
> bottleneck because the processor is 3 orders of magnitude faster than
> RAM and 6 orders faster than the disk.  Once you exhaust the I/O bus,
> it's exhausted.


I/O is not the bottleneck. I have 8 NVMe drives in RAID0. I have not been able to drive the disks in the slightest because the threads spend the majority of their time in sqlite3_open_v2, sqlite3_close, and sqlite3_prepare_v2.


Jason Lee

________________________________
From: sqlite-users <[hidden email]> on behalf of James K. Lowden <[hidden email]>
Sent: Monday, April 22, 2019 4:53:42 PM
To: [hidden email]
Subject: Re: [sqlite] Multiple Independent Database Instances

On Mon, 22 Apr 2019 21:25:31 +0000
"Lee, Jason" <[hidden email]> wrote:

> I have a set of several million database files sitting on my
> filesystem. Each thread will open a previously unprocessed database
> file, do some queries, close the database, and move on to the next
> unprocessed database file.

Fascinating.  One wonders what Feynman would have said.

Even with gobs of RAM and solid-state storage, I/O will quickly
bottleneck because the processor is 3 orders of magnitude faster than
RAM and 6 orders faster than the disk.  Once you exhaust the I/O bus,
it's exhausted.

I would build a pipeline, and let processes do the work. Write a program
to process a single database: open, query, output, close.  Then define
a make(1) rule to convert one database into one output file.  Then run
"make -j dd" where "dd" is the number of simultaneous processes (or
"jobs").  I think you'll find ~10 processes is all you can sustain.

You could use the sqlite3 utility as your "program", but it's not very
good at detecting errors and returning a nonzero return status to the
OS. Hence a bespoke program.  Also, you can get the data into binary
form, suitable for concatenation into one big file for input into your
numerical process.  That will go a lot faster.

Although there's some overhead to invoking a million processes, it's
dwarfed by the I/O time.

The advantage of doing the work under make is that it's reusable and
restartable.  if you bury the machine, you can kill make and restart it
with a lower number of jobs.  If you find some databases are corrupt or
incomplete, you can replace them, and make will reprocess only the new
ones.  If you add other databases at a later time, make will process
only those.  You can add subsequent steps, too; make won't start from
square 1 unless it has to.

With millions of inputs, the odds are you will find problems.
Perfectly good input over a dataset that size probably occured before
in recorded history, but not frequently.

I assume your millions of databases are not in a single directory; I'd
guess you have 1000s of directories.  They offer convenient work
partitions, which you might need; I have no idea how make will respond
to a dependency tree with millions of nodes.

--jkl

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Multiple Independent Database Instances

Simon Slavin-3
On 23 Apr 2019, at 12:32am, Lee, Jason <[hidden email]> wrote:

> The current code is effectively just an sqlite3_open_v2 followed by an sqlite3_close

Then either your code is faulty, and doesn't actually do this, or your problem has nothing to do with SQLite.

SQLite doesn't open a database file when you use sqlite3_open_v2().  It doesn't even see whether the file, or even the path, exists.  The file is opened only when you use an API function which needs to read or write the file.  _open() followed by _close() just uses up a little memory to store the file path and some other settings, then releases it again.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Multiple Independent Database Instances

Keith Medcalf
In reply to this post by Lee, Jason

Interesting.  If you can guarantee that you will only have a single thread accessing a single database only from one single thread, give it a try with SQLITE_OPEN_NOMUTEX | SQLITE_OPEN_READONLY in the flags parameter of sqlite3_open_v2 ...

Don't know if it will make a difference, but it might.

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.


>-----Original Message-----
>From: sqlite-users [mailto:sqlite-users-
>[hidden email]] On Behalf Of Lee, Jason
>Sent: Monday, 22 April, 2019 17:33
>To: [hidden email]
>Subject: Re: [sqlite] Multiple Independent Database Instances
>
>> How does each thread know whether the file has been "previously
>processed" or not?
>
>
>The paths are pushed onto a queue and each thread pops the top off. I
>am also looking into the queuing code to see if there are issues
>
>
>> In other words, if you "get rid of" all the sqlite3 processing and
>replace it with a 5 ms sleep, does increasing the number of threads
>exhibit the same symptom?
>
>
>The timings were for sqlite3_open_v2, not for the whole process. The
>current code is effectively just an sqlite3_open_v2 followed by an
>sqlite3_close, and yet the time it takes to complete sqlite3_open_v2
>still increases with the number of threads.
>
>
>> Even with gobs of RAM and solid-state storage, I/O will quickly
>> bottleneck because the processor is 3 orders of magnitude faster
>than
>> RAM and 6 orders faster than the disk.  Once you exhaust the I/O
>bus,
>> it's exhausted.
>
>
>I/O is not the bottleneck. I have 8 NVMe drives in RAID0. I have not
>been able to drive the disks in the slightest because the threads
>spend the majority of their time in sqlite3_open_v2, sqlite3_close,
>and sqlite3_prepare_v2.
>
>
>Jason Lee
>
>________________________________
>From: sqlite-users <[hidden email]> on
>behalf of James K. Lowden <[hidden email]>
>Sent: Monday, April 22, 2019 4:53:42 PM
>To: [hidden email]
>Subject: Re: [sqlite] Multiple Independent Database Instances
>
>On Mon, 22 Apr 2019 21:25:31 +0000
>"Lee, Jason" <[hidden email]> wrote:
>
>> I have a set of several million database files sitting on my
>> filesystem. Each thread will open a previously unprocessed database
>> file, do some queries, close the database, and move on to the next
>> unprocessed database file.
>
>Fascinating.  One wonders what Feynman would have said.
>
>Even with gobs of RAM and solid-state storage, I/O will quickly
>bottleneck because the processor is 3 orders of magnitude faster than
>RAM and 6 orders faster than the disk.  Once you exhaust the I/O bus,
>it's exhausted.
>
>I would build a pipeline, and let processes do the work. Write a
>program
>to process a single database: open, query, output, close.  Then
>define
>a make(1) rule to convert one database into one output file.  Then
>run
>"make -j dd" where "dd" is the number of simultaneous processes (or
>"jobs").  I think you'll find ~10 processes is all you can sustain.
>
>You could use the sqlite3 utility as your "program", but it's not
>very
>good at detecting errors and returning a nonzero return status to the
>OS. Hence a bespoke program.  Also, you can get the data into binary
>form, suitable for concatenation into one big file for input into
>your
>numerical process.  That will go a lot faster.
>
>Although there's some overhead to invoking a million processes, it's
>dwarfed by the I/O time.
>
>The advantage of doing the work under make is that it's reusable and
>restartable.  if you bury the machine, you can kill make and restart
>it
>with a lower number of jobs.  If you find some databases are corrupt
>or
>incomplete, you can replace them, and make will reprocess only the
>new
>ones.  If you add other databases at a later time, make will process
>only those.  You can add subsequent steps, too; make won't start from
>square 1 unless it has to.
>
>With millions of inputs, the odds are you will find problems.
>Perfectly good input over a dataset that size probably occured before
>in recorded history, but not frequently.
>
>I assume your millions of databases are not in a single directory;
>I'd
>guess you have 1000s of directories.  They offer convenient work
>partitions, which you might need; I have no idea how make will
>respond
>to a dependency tree with millions of nodes.
>
>--jkl
>
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Multiple Independent Database Instances

Tommy Lane-2
Have you tried utilizing a RAMdisk in your pipeline? Before adding a database to the queue of work move/copy the file to somewhere in /tmp/  then point the worker thread at it. The accesses should be much faster.

Hope you’re making progress.
-Tommy

On Apr 22, 2019, at 6:44 PM, Keith Medcalf <[hidden email]> wrote:


Interesting.  If you can guarantee that you will only have a single thread accessing a single database only from one single thread, give it a try with SQLITE_OPEN_NOMUTEX | SQLITE_OPEN_READONLY in the flags parameter of sqlite3_open_v2 ...

Don't know if it will make a difference, but it might.

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.


> -----Original Message-----
> From: sqlite-users [mailto:sqlite-users-
> [hidden email]] On Behalf Of Lee, Jason
> Sent: Monday, 22 April, 2019 17:33
> To: [hidden email]
> Subject: Re: [sqlite] Multiple Independent Database Instances
>
>> How does each thread know whether the file has been "previously
> processed" or not?
>
>
> The paths are pushed onto a queue and each thread pops the top off. I
> am also looking into the queuing code to see if there are issues
>
>
>> In other words, if you "get rid of" all the sqlite3 processing and
> replace it with a 5 ms sleep, does increasing the number of threads
> exhibit the same symptom?
>
>
> The timings were for sqlite3_open_v2, not for the whole process. The
> current code is effectively just an sqlite3_open_v2 followed by an
> sqlite3_close, and yet the time it takes to complete sqlite3_open_v2
> still increases with the number of threads.
>
>
>> Even with gobs of RAM and solid-state storage, I/O will quickly
>> bottleneck because the processor is 3 orders of magnitude faster
> than
>> RAM and 6 orders faster than the disk.  Once you exhaust the I/O
> bus,
>> it's exhausted.
>
>
> I/O is not the bottleneck. I have 8 NVMe drives in RAID0. I have not
> been able to drive the disks in the slightest because the threads
> spend the majority of their time in sqlite3_open_v2, sqlite3_close,
> and sqlite3_prepare_v2.
>
>
> Jason Lee
>
> ________________________________
> From: sqlite-users <[hidden email]> on
> behalf of James K. Lowden <[hidden email]>
> Sent: Monday, April 22, 2019 4:53:42 PM
> To: [hidden email]
> Subject: Re: [sqlite] Multiple Independent Database Instances
>
> On Mon, 22 Apr 2019 21:25:31 +0000
> "Lee, Jason" <[hidden email]> wrote:
>
>> I have a set of several million database files sitting on my
>> filesystem. Each thread will open a previously unprocessed database
>> file, do some queries, close the database, and move on to the next
>> unprocessed database file.
>
> Fascinating.  One wonders what Feynman would have said.
>
> Even with gobs of RAM and solid-state storage, I/O will quickly
> bottleneck because the processor is 3 orders of magnitude faster than
> RAM and 6 orders faster than the disk.  Once you exhaust the I/O bus,
> it's exhausted.
>
> I would build a pipeline, and let processes do the work. Write a
> program
> to process a single database: open, query, output, close.  Then
> define
> a make(1) rule to convert one database into one output file.  Then
> run
> "make -j dd" where "dd" is the number of simultaneous processes (or
> "jobs").  I think you'll find ~10 processes is all you can sustain.
>
> You could use the sqlite3 utility as your "program", but it's not
> very
> good at detecting errors and returning a nonzero return status to the
> OS. Hence a bespoke program.  Also, you can get the data into binary
> form, suitable for concatenation into one big file for input into
> your
> numerical process.  That will go a lot faster.
>
> Although there's some overhead to invoking a million processes, it's
> dwarfed by the I/O time.
>
> The advantage of doing the work under make is that it's reusable and
> restartable.  if you bury the machine, you can kill make and restart
> it
> with a lower number of jobs.  If you find some databases are corrupt
> or
> incomplete, you can replace them, and make will reprocess only the
> new
> ones.  If you add other databases at a later time, make will process
> only those.  You can add subsequent steps, too; make won't start from
> square 1 unless it has to.
>
> With millions of inputs, the odds are you will find problems.
> Perfectly good input over a dataset that size probably occured before
> in recorded history, but not frequently.
>
> I assume your millions of databases are not in a single directory;
> I'd
> guess you have 1000s of directories.  They offer convenient work
> partitions, which you might need; I have no idea how make will
> respond
> to a dependency tree with millions of nodes.
>
> --jkl
>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users