Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

Joseph Fernandes
Hi There,

1) We are trying to use sqlite3 in Glusterfs as a Change Time Recording Data Store
i.e using sqlite3 db to record any modification attempts that happens on the glusterfs
per file inode, So that it can be used to indicate the hotness of the file in the cluster.

2) We have developed a prototype that inserts/updates inode records in sqlite3 db(one instance of the
db file per glusterfs storage unit or as we call it a brick) to record the modification
that are happening on the inode. These insert/updates are in the IO path of the file i.e any data or
metadata change time will be record inline, w.r.t the IO, in the DB.

3) We only save the modification time on the to the db, so that
we can easily query for "What files have change during a specific period of time" or
"What files have not change during a specific period of time". The query is done by a scanner
that runner on each storage unit or brick on a regular interval and chooses HOT or COLD files as suggested
by the DB for data maintenance operations.

4) Therefore, we are looking at a datastore that can give us a very quick write(almost zero latency,
as the recording is done inline w.r.t file IO) and that as good data querying
facilities(Slight latency in the read is fine but the fresh of that record data should be spot on).

5) Please find our DB setting,
Journal mode : WAL
SYNC MODE : NORMAL
Cache Size : 1000 - 4096 Pages (Default Page Size)
AutoCheck-Pointing: 1000 - 1000000 Pages

We wanted to known the following
1) How could we improve the performance on the write side so that we have minimal latency?
2) Will ther be any write performance hit when the number of records in the DB increase?

Thanks in advance

~Joe
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

Keith Medcalf

Your Glusterfs does not store modification times?  Have you considered adding the modification time attribute to the inode directly?

---
Theory is when you know everything but nothing works.  Practice is when everything works but no one knows why.  Sometimes theory and practice are combined:  nothing works and no one knows why.

>-----Original Message-----
>From: [hidden email] [mailto:sqlite-users-
>[hidden email]] On Behalf Of Joseph Fernandes
>Sent: Friday, 21 November, 2014 12:02
>To: [hidden email]
>Subject: [sqlite] Using Sqlite3 as a Change Time Recording Data Store in
>Glusterfs
>
>Hi There,
>
>1) We are trying to use sqlite3 in Glusterfs as a Change Time Recording
>Data Store
>i.e using sqlite3 db to record any modification attempts that happens on
>the glusterfs
>per file inode, So that it can be used to indicate the hotness of the
>file in the cluster.
>
>2) We have developed a prototype that inserts/updates inode records in
>sqlite3 db(one instance of the
>db file per glusterfs storage unit or as we call it a brick) to record
>the modification
>that are happening on the inode. These insert/updates are in the IO path
>of the file i.e any data or
>metadata change time will be record inline, w.r.t the IO, in the DB.
>
>3) We only save the modification time on the to the db, so that
>we can easily query for "What files have change during a specific period
>of time" or
>"What files have not change during a specific period of time". The query
>is done by a scanner
>that runner on each storage unit or brick on a regular interval and
>chooses HOT or COLD files as suggested
>by the DB for data maintenance operations.
>
>4) Therefore, we are looking at a datastore that can give us a very quick
>write(almost zero latency,
>as the recording is done inline w.r.t file IO) and that as good data
>querying
>facilities(Slight latency in the read is fine but the fresh of that
>record data should be spot on).
>
>5) Please find our DB setting,
>Journal mode : WAL
>SYNC MODE : NORMAL
>Cache Size : 1000 - 4096 Pages (Default Page Size)
>AutoCheck-Pointing: 1000 - 1000000 Pages
>
>We wanted to known the following
>1) How could we improve the performance on the write side so that we have
>minimal latency?
>2) Will ther be any write performance hit when the number of records in
>the DB increase?
>
>Thanks in advance
>
>~Joe
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

Joseph Fernandes
As any Filesystem glusterfs also has a atime,mtime and ctime. But we are trying to optimize the search of file that are "HOT" or "COLD" using sqlite3, As a filesystem walk would kill the scanners(which are doing data maintenance)

~Joe

----- Original Message -----
From: "Keith Medcalf" <[hidden email]>
To: "General Discussion of SQLite Database" <[hidden email]>
Sent: Saturday, November 22, 2014 9:02:30 PM
Subject: Re: [sqlite] Using Sqlite3 as a Change Time Recording Data Store in Glusterfs


Your Glusterfs does not store modification times?  Have you considered adding the modification time attribute to the inode directly?

---
Theory is when you know everything but nothing works.  Practice is when everything works but no one knows why.  Sometimes theory and practice are combined:  nothing works and no one knows why.

>-----Original Message-----
>From: [hidden email] [mailto:sqlite-users-
>[hidden email]] On Behalf Of Joseph Fernandes
>Sent: Friday, 21 November, 2014 12:02
>To: [hidden email]
>Subject: [sqlite] Using Sqlite3 as a Change Time Recording Data Store in
>Glusterfs
>
>Hi There,
>
>1) We are trying to use sqlite3 in Glusterfs as a Change Time Recording
>Data Store
>i.e using sqlite3 db to record any modification attempts that happens on
>the glusterfs
>per file inode, So that it can be used to indicate the hotness of the
>file in the cluster.
>
>2) We have developed a prototype that inserts/updates inode records in
>sqlite3 db(one instance of the
>db file per glusterfs storage unit or as we call it a brick) to record
>the modification
>that are happening on the inode. These insert/updates are in the IO path
>of the file i.e any data or
>metadata change time will be record inline, w.r.t the IO, in the DB.
>
>3) We only save the modification time on the to the db, so that
>we can easily query for "What files have change during a specific period
>of time" or
>"What files have not change during a specific period of time". The query
>is done by a scanner
>that runner on each storage unit or brick on a regular interval and
>chooses HOT or COLD files as suggested
>by the DB for data maintenance operations.
>
>4) Therefore, we are looking at a datastore that can give us a very quick
>write(almost zero latency,
>as the recording is done inline w.r.t file IO) and that as good data
>querying
>facilities(Slight latency in the read is fine but the fresh of that
>record data should be spot on).
>
>5) Please find our DB setting,
>Journal mode : WAL
>SYNC MODE : NORMAL
>Cache Size : 1000 - 4096 Pages (Default Page Size)
>AutoCheck-Pointing: 1000 - 1000000 Pages
>
>We wanted to known the following
>1) How could we improve the performance on the write side so that we have
>minimal latency?
>2) Will ther be any write performance hit when the number of records in
>the DB increase?
>
>Thanks in advance
>
>~Joe
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

Eduardo Morras-2
In reply to this post by Joseph Fernandes
On Fri, 21 Nov 2014 14:01:39 -0500 (EST)
Joseph Fernandes <[hidden email]> wrote:

> Hi There,
>
> 1) We are trying to use sqlite3 in Glusterfs as a Change Time
> Recording Data Store i.e using sqlite3 db to record any modification
> attempts that happens on the glusterfs per file inode, So that it can
> be used to indicate the hotness of the file in the cluster.
>
> 2) We have developed a prototype that inserts/updates inode records
> in sqlite3 db(one instance of the db file per glusterfs storage unit
> or as we call it a brick) to record the modification that are
> happening on the inode. These insert/updates are in the IO path of
> the file i.e any data or metadata change time will be record inline,
> w.r.t the IO, in the DB.
>
> 3) We only save the modification time on the to the db, so that
> we can easily query for "What files have change during a specific
> period of time" or "What files have not change during a specific
> period of time". The query is done by a scanner that runner on each
> storage unit or brick on a regular interval and chooses HOT or COLD
> files as suggested by the DB for data maintenance operations.
>
> 4) Therefore, we are looking at a datastore that can give us a very
> quick write(almost zero latency, as the recording is done inline
> w.r.t file IO) and that as good data querying facilities(Slight
> latency in the read is fine but the fresh of that record data should
> be spot on).
>
> 5) Please find our DB setting,
> Journal mode : WAL
> SYNC MODE : NORMAL
> Cache Size : 1000 - 4096 Pages (Default Page Size)
> AutoCheck-Pointing: 1000 - 1000000 Pages
>
> We wanted to known the following
> 1) How could we improve the performance on the write side so that we
> have minimal latency?

I'll wait for the ota extension Mr. Hipps is developing. It will (If I Understood Correctly) reorder the commit operations to get faster IO throughput.

Some tricks:
a)If you use Sqlite in single thread, compile it without thread support.
b)Adjust the page size to fit in a hard disk cluster size (8KB IIRC) or the multiply of it that allows integer number of rows fits in.
c)Disable autovacuum



> 2) Will ther be any write performance hit when the number of records
> in the DB increase?
>
> Thanks in advance
>
> ~Joe


---   ---
Eduardo Morras <[hidden email]>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

Joseph Fernandes
Thanks Eduardo.

Answers inline JOE>>

----- Original Message -----
From: "Eduardo Morras" <[hidden email]>
To: [hidden email]
Sent: Saturday, November 22, 2014 9:20:46 PM
Subject: Re: [sqlite] Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

On Fri, 21 Nov 2014 14:01:39 -0500 (EST)
Joseph Fernandes <[hidden email]> wrote:

> Hi There,
>
> 1) We are trying to use sqlite3 in Glusterfs as a Change Time
> Recording Data Store i.e using sqlite3 db to record any modification
> attempts that happens on the glusterfs per file inode, So that it can
> be used to indicate the hotness of the file in the cluster.
>
> 2) We have developed a prototype that inserts/updates inode records
> in sqlite3 db(one instance of the db file per glusterfs storage unit
> or as we call it a brick) to record the modification that are
> happening on the inode. These insert/updates are in the IO path of
> the file i.e any data or metadata change time will be record inline,
> w.r.t the IO, in the DB.
>
> 3) We only save the modification time on the to the db, so that
> we can easily query for "What files have change during a specific
> period of time" or "What files have not change during a specific
> period of time". The query is done by a scanner that runner on each
> storage unit or brick on a regular interval and chooses HOT or COLD
> files as suggested by the DB for data maintenance operations.
>
> 4) Therefore, we are looking at a datastore that can give us a very
> quick write(almost zero latency, as the recording is done inline
> w.r.t file IO) and that as good data querying facilities(Slight
> latency in the read is fine but the fresh of that record data should
> be spot on).
>
> 5) Please find our DB setting,
> Journal mode : WAL
> SYNC MODE : NORMAL
> Cache Size : 1000 - 4096 Pages (Default Page Size)
> AutoCheck-Pointing: 1000 - 1000000 Pages
>
> We wanted to known the following
> 1) How could we improve the performance on the write side so that we
> have minimal latency?

I'll wait for the ota extension Mr. Hipps is developing. It will (If I Understood Correctly) reorder the commit operations to get faster IO throughput.

Some tricks:
a)If you use Sqlite in single thread, compile it without thread support.
JOE>> We are running it in multithread mode, as the database will be fed by multiple File IO thread.
b)Adjust the page size to fit in a hard disk cluster size (8KB IIRC) or the multiply of it that allows integer number of rows fits in.
c)Disable autovacuum
JOE>> Sure will try this out in my tests



> 2) Will ther be any write performance hit when the number of records
> in the DB increase?
>
> Thanks in advance
>
> ~Joe


---   ---
Eduardo Morras <[hidden email]>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

James K. Lowden
In reply to this post by Joseph Fernandes
On Fri, 21 Nov 2014 14:01:39 -0500 (EST)
Joseph Fernandes <[hidden email]> wrote:

> 4) Therefore, we are looking at a datastore that can give us a very
> quick write(almost zero latency, as the recording is done inline
> w.r.t file IO) and that as good data querying facilities(Slight
> latency in the read is fine but the fresh of that record data should
> be spot on).

This strikes me as a classic case for "record, then analyze".  I would
capture the data more cheaply and use SQLite to decide what to do.  

You don't really care if the recorded times are exactly right; a few
missed updates wouldn't affect the cold/hot status very much.  You
should be willing to lose a few if you improve write latency.  OTOH
the maintenance operation isn't *very* time critical; you just can't
afford to walk the whole tree first.  

That suggests two possibilities for capture:

1.  Keep a sequential file of {name,time} or {inode,time} pairs
(whichever is more convenient to use).  By using O_APPEND you get
atomic writes and perfect captures across threads.  fsync(2) as
desired.  

2.  If in practice the above file grows too large, use a primitive
hashing store such as BerkeleyDB to capture counts by name/inode.  It's
not even obvious you need an external store; you might be able to get
away with std::hash_map in C++ and periodically serialize that.  ISTM
you don't need to worry about concurrency because a few missed updates
here and there won't change much.  

At maintenance time, scoop the file into a SQLite table, and you're
back where you started, except you already have zero
write-time latency.  

HTH.  

--jkl
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

Joseph Fernandes
Thanks James

Please find the answers inline JOE>>

----- Original Message -----
From: "James K. Lowden" <[hidden email]>
To: [hidden email]
Sent: Saturday, November 22, 2014 10:29:01 PM
Subject: Re: [sqlite] Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

On Fri, 21 Nov 2014 14:01:39 -0500 (EST)
Joseph Fernandes <[hidden email]> wrote:

> 4) Therefore, we are looking at a datastore that can give us a very
> quick write(almost zero latency, as the recording is done inline
> w.r.t file IO) and that as good data querying facilities(Slight
> latency in the read is fine but the fresh of that record data should
> be spot on).

This strikes me as a classic case for "record, then analyze".  I would
capture the data more cheaply and use SQLite to decide what to do.  

You don't really care if the recorded times are exactly right; a few
missed updates wouldn't affect the cold/hot status very much.  You
should be willing to lose a few if you improve write latency.  OTOH
the maintenance operation isn't *very* time critical; you just can't
afford to walk the whole tree first.  

JOE>> Agree! We can afford missing some updates w.r.t subsequent IO i.e
If a file is been written with data then we don't update the db for every
write that comes in. We have a timer and counter based approach that guides
the when the db should be updated for data IO. For Metadata IO we don't want
to miss the updates, as we also keep the record of hardlinks for a inode in
the database (gluster internal requirement for data maintainer scanners).

That suggests two possibilities for capture:

1.  Keep a sequential file of {name,time} or {inode,time} pairs
(whichever is more convenient to use).  By using O_APPEND you get
atomic writes and perfect captures across threads.  fsync(2) as
desired.  

2.  If in practice the above file grows too large, use a primitive
hashing store such as BerkeleyDB to capture counts by name/inode.  It's
not even obvious you need an external store; you might be able to get
away with std::hash_map in C++ and periodically serialize that.  ISTM
you don't need to worry about concurrency because a few missed updates
here and there won't change much.  

At maintenance time, scoop the file into a SQLite table, and you're
back where you started, except you already have zero
write-time latency.  

JOE>>
1) Well we already have a journal log called "changelog" that does that for
us(records each modification as a log entry).
We have plans of using it to feed the db. The major challenge with
changelog is, its built to be crash consistent, as its used for geo-replication
feature in glusterfs i.e should not miss any update, as it needs to relay it later to the geo-replica,
this brings in performance issues in the IO path.
When compared with changelog, sqlite3(with WAL{Write-ahead-logging} which is similar to the
logging which you suggest, but provided by sqlite itself) shows better performance
in the IO path.
There will be setups where geo-rep is not required and hence we need not enable the changelog,
hence we are planning to provide two options here
      a) Feed the DB(with WAL) directly via IO path (if geo-rep is off)
      b) Feed the DB via changelog(if geo-rep is on), because we don't want two kind of latency
         hit the IO path, one from the DB update another from changelog.
2) Using the changelog to feed the db has another issue i.e freshness of data in the DB w.r.t the IO.
Few of our data maintainer scanners would require the freshness of the feed to be close to
real. To have that we are planning to uses a in-memory view data-structure(like a LRU)
 that get updated through the IO path(in parallel to changelog which is updated independently),
and then using a seperate scheduled(frequently) notification-thread that would update the DB
(the notifier is not blocked while broadcasting updates). Your thoughts on this.
3) Now that we would use Sqlite3(with WAL) to be direcly feed by the IO path(in the absence of changelog)
we are looking to get the best performance from it. Crash consistency wouldn't be a major requirement
for now. But performance and freshness of data in the DB should be a spot-on.


HTH.  

--jkl
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

Eduardo Morras-2
In reply to this post by Joseph Fernandes
On Sat, 22 Nov 2014 11:01:45 -0500 (EST)
Joseph Fernandes <[hidden email]> wrote:

> Thanks Eduardo.
>
> Answers inline JOE>>
>

> a)If you use Sqlite in single thread, compile it without thread
> support.
> JOE>> We are running it in multithread mode, as the database will be
> JOE>> fed by multiple File IO thread.

Yes and no, your app can be multithread, but if only one thread has the sqlite3 pointer, you can use single thread sqlite3. This way, you have multiple working threads and one sqlite3 dedicated thread that recives the data to write and the data queries.

---   ---
Eduardo Morras <[hidden email]>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

Joseph Fernandes
Ok I get it now. Yes we are using a single db connection object, But few questions,
1) how would making sqlite3 single thread that improve the performance? Is there a special advantage in this mode than the multithread one?
2) Will it not block (wait or give a SQLITE_BUSY) the multiple thread of the app (in our case glusterfs IO threads)
as now they have to wait for the single thread to complete the task?

~Joe


----- Original Message -----
From: "Eduardo Morras" <[hidden email]>
To: [hidden email]
Sent: Monday, November 24, 2014 12:29:33 AM
Subject: Re: [sqlite] Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

On Sat, 22 Nov 2014 11:01:45 -0500 (EST)
Joseph Fernandes <[hidden email]> wrote:

> Thanks Eduardo.
>
> Answers inline JOE>>
>

> a)If you use Sqlite in single thread, compile it without thread
> support.
> JOE>> We are running it in multithread mode, as the database will be
> JOE>> fed by multiple File IO thread.

Yes and no, your app can be multithread, but if only one thread has the sqlite3 pointer, you can use single thread sqlite3. This way, you have multiple working threads and one sqlite3 dedicated thread that recives the data to write and the data queries.

---   ---
Eduardo Morras <[hidden email]>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

James K. Lowden
In reply to this post by Joseph Fernandes
On Sat, 22 Nov 2014 23:25:16 -0500 (EST)
Joseph Fernandes <[hidden email]> wrote:

> 2) Using the changelog to feed the db has another issue i.e freshness
> of data in the DB w.r.t the IO. Few of our data maintainer scanners
> would require the freshness of the feed to be close to real. [...]
> Your thoughts on this.

If your in-memory LRU structure suffices to describe all "hot" files,
you're in good shape.  Rather than dumping periodically, I would
consider placing it in shared memory and write a virtual table
function for it in SQLite, such that it can be queried directly as
needed.  

To me based on your description your choice isn't how best to use
SQLite in line with I/O, but how best to capture the data such that
they can be aggregated with SQLite at time of update.  That choice is
one of two: 1) capture each I/O event in a sequential file, always
appending, or 2) maintain per-file counts in a hash or map.  Which is
better depends on how much you're willing to pay for each I/O.  By
paying the lookup cost of #2 each time, the total space is smaller and
the maintenance-time computation less.  

> 3) Now that we would use Sqlite3(with WAL) to be direcly feed by the
> IO path(in the absence of changelog) we are looking to get the best
> performance from it.

Metadata updates to Posix filesystems are seen as so costly that
fsync(2) on the datafile descriptor doesn't update them.  A separate
sync on the directory is required.  Compared to an in-memory update
(of metadata, in kernel space) and a single fsync call, the price of a
SQLite transaction is enormous, at a guess an order of magnitude more.
Bear in mind that WAL buys you not efficiency but consistency, the very
thing you don't really need.  The data are written sequentially to the
log and then inserted into the table.  You can expect no better than
O(n log n) performance.  Filesystems generally would never tolerate
that, but for your application you'd be the judge.

--jkl
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

Nico Williams
On Sun, Nov 23, 2014 at 4:26 PM, James K. Lowden
<[hidden email]> wrote:
> Metadata updates to Posix filesystems are seen as so costly that
> fsync(2) on the datafile descriptor doesn't update them.  A separate
> sync on the directory is required.  Compared to an in-memory update
> (of metadata, in kernel space) and a single fsync call, the price of a
> SQLite transaction is enormous, at a guess an order of magnitude more.

This.  Updates of mtime and atime in particular are expensive.

Another problem (for Lustre-style clusters) is stat(), since it
returns information that only a metadata service might know (e.g.,
file type, file ID (st_dev and st_ino), and link count) but also
information that it might not (file size, various timestamps), which
then presents enormous headaches for implementors.  There are also
write visibility rules as to stat() that some applications depend
on...  This is why "ls -l" can be slow on some such clusters.

Nico
--
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

Valentin Davydov-2
In reply to this post by Joseph Fernandes
On Fri, Nov 21, 2014 at 02:01:39PM -0500, Joseph Fernandes wrote:

> We wanted to known the following
> 1) How could we improve the performance on the write side so that we have minimal latency?
> 2) Will ther be any write performance hit when the number of records in the DB increase?

Generally speaking, you have to do some work to arrange your data (modification
times) in some ordered way. This work can be done eihter in advance, as you
suggest, or on demand, as some people have already told you. But anyway this
will eat up necessary resources, regardless of whose code would do it, either
yours or SQLite. In practice (given magnetic disks as underlying storage),
most scarce of the mentioned resources is rotational/seek latency, which
detrimentally affects all disk operations of any scheduled priority. SQLite
performs extensive random disk access (mostly reads) on most operation
scenarios - selects, inserts, indexing etc. with possible exception of
small updates of non-indexed data (which are accessed in a similar fashion
by later selects). The only way to cope with the slow disk is keeping all
necessary data somwhere else, for example, into the RAM cache. Of course,
cache itself should be populated in advance to give this benefit, and, given
current RAM prices, it seems not very feasible to steal available memory from
smart applications in favour of dumb cache.

Hope, this considerations will help you in tuning your code.

Valentin Davydov.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

Nico Williams
In reply to this post by Joseph Fernandes
My advice is to borrow from other clustered filesystems' experience.

If you want to adhere to POSIX semantics then st_mtime and st_size
visibility will be a particular headache, especially since you don't
know when it's OK to lie (i.e., which callers of stat() are using
st_mtime/st_size for synchronization).

Ideally we'd split stat() into metastat() and datastat()...

Nico
--
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

Nico Williams
In reply to this post by Joseph Fernandes
BTW, the experience with dedup is that doing things off-line means
never catching up, while doing them online means going slow.

You might cache as much as you can in memory then go slow when you
miss the cache...

In practice I think it's best to separate data and metadata devices so
that you can make metadata as fast as you can, writing COW-style, like
ZFS does, and caching in memory on clients and servers as much as
possible for all data writes during the writes until they are
committed.

Nico
--
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

Joseph Fernandes
In reply to this post by James K. Lowden


----- Original Message -----
From: "James K. Lowden" <[hidden email]>
To: [hidden email]
Sent: Monday, November 24, 2014 3:56:14 AM
Subject: Re: [sqlite] Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

On Sat, 22 Nov 2014 23:25:16 -0500 (EST)
Joseph Fernandes <[hidden email]> wrote:

> 2) Using the changelog to feed the db has another issue i.e freshness
> of data in the DB w.r.t the IO. Few of our data maintainer scanners
> would require the freshness of the feed to be close to real. [...]
> Your thoughts on this.

If your in-memory LRU structure suffices to describe all "hot" files,
you're in good shape.  Rather than dumping periodically, I would
consider placing it in shared memory and write a virtual table
function for it in SQLite, such that it can be queried directly as
needed.  

To me based on your description your choice isn't how best to use
SQLite in line with I/O, but how best to capture the data such that
they can be aggregated with SQLite at time of update.  That choice is
one of two: 1) capture each I/O event in a sequential file, always
appending, or 2) maintain per-file counts in a hash or map.  Which is
better depends on how much you're willing to pay for each I/O.  By
paying the lookup cost of #2 each time, the total space is smaller and
the maintenance-time computation less.  

> 3) Now that we would use Sqlite3(with WAL) to be direcly feed by the
> IO path(in the absence of changelog) we are looking to get the best
> performance from it.

Metadata updates to Posix filesystems are seen as so costly that
fsync(2) on the datafile descriptor doesn't update them.  A separate
sync on the directory is required.  Compared to an in-memory update
(of metadata, in kernel space) and a single fsync call, the price of a
SQLite transaction is enormous, at a guess an order of magnitude more.
Bear in mind that WAL buys you not efficiency but consistency, the very
thing you don't really need.  The data are written sequentially to the
log and then inserted into the table.  You can expect no better than
O(n log n) performance.  Filesystems generally would never tolerate
that, but for your application you'd be the judge.

>> Ok few questions on the WAL journal mechanism,
1) As far as I understand(I may be wrong), During a insert or update WAL
 just records it sequentially in WAL file. And during a checkpoint (manual or automatic)
 a. Flush the in-memory appends in the WAL file
 b. and then Merges with the actually tables.
 And therefore check pointing takes a toll on the performance. But if I don't
 do check point often i.e set auto checkpoint to say 1 GB or 2 GB.
 Following would be the implications of doing so
 a. Read will be slow because now I will have a large amount of data in the log
    and it would take time to read and collate data from the log. We are fine with
    this in our usage case as we DON'T read (select queries) from database in the file IO path ever.
    These are done by data maintaining scanner which are scheduled. When tested with 1 million records
    and joining two tables and having the WAL log file to grow to 10 gb ,
    it takes almost 1 min to retrieve 1 million records. Which is fine for data scanners as they
    the waited for hour-hours.
 b. we will occupy huge space for the WAL Log file.

 Here is the question does WAL during an insert/update in the log file do any internal search/sort
 and then insert/update to the log or just appends the WAL log with the incoming insert/update entry ?

2) Glusterfs is not replacing POSIX atime,mtime and ctime with this db (Sqlite is not a metadata db). i.e
   stat will always read from actual atime/mtime/ctime of the inode. Therefore as mention in [1]
   we don't do any db read operations in the file IO path.

--jkl
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

Joseph Fernandes
In reply to this post by Nico Williams


----- Original Message -----
From: "Nico Williams" <[hidden email]>
To: "General Discussion of SQLite Database" <[hidden email]>
Sent: Monday, November 24, 2014 4:35:59 PM
Subject: Re: [sqlite] Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

On Sun, Nov 23, 2014 at 4:26 PM, James K. Lowden
<[hidden email]> wrote:
> Metadata updates to Posix filesystems are seen as so costly that
> fsync(2) on the datafile descriptor doesn't update them.  A separate
> sync on the directory is required.  Compared to an in-memory update
> (of metadata, in kernel space) and a single fsync call, the price of a
> SQLite transaction is enormous, at a guess an order of magnitude more.

This.  Updates of mtime and atime in particular are expensive.

Another problem (for Lustre-style clusters) is stat(), since it
returns information that only a metadata service might know (e.g.,
file type, file ID (st_dev and st_ino), and link count) but also
information that it might not (file size, various timestamps), which
then presents enormous headaches for implementors.  There are also
write visibility rules as to stat() that some applications depend
on...  This is why "ls -l" can be slow on some such clusters.

JOE>>
Glusterfs is not replacing POSIX atime,mtime and ctime with this db (Sqlite is not a metadata db). i.e
stat will always read from actual atime/mtime/ctime of the inode. Therefore as mention in [1]
we don't do any db read operations in the file IO path. The read consumer of the db are data maintaining
scanners that are scheduled, and which want a smarter way to find the "HOT" and "COLD" files in the cluster.
Infact there will be a one db per storage unit(brick) of glusterfs therefore no centralized db.

Nico
--
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

Joseph Fernandes
In reply to this post by Valentin Davydov-2
Again asking the same question :
 a. Does WAL during an insert/update in the log file do any internal search/sort
 and then insert/update to the log or
 b. Just appends the WAL log with the incoming insert/update entry, thus keeping the
    writes sequential and during a checkpoint (manual or automatic) does the merging with the actual tables,
    which as you pointed out will have to search/sort causing alot of random disk access?
 c. Also we don't read the db in file IO path. The read are for scheduled data maintenance scanners,
    thus making the random read of disk a occasional event.  
 Just trying to assess if my understanding of sqlite WAL is correct.

~Joe

----- Original Message -----
From: "Valentin Davydov" <[hidden email]>
To: "General Discussion of SQLite Database" <[hidden email]>
Sent: Monday, November 24, 2014 4:37:12 PM
Subject: Re: [sqlite] Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

On Fri, Nov 21, 2014 at 02:01:39PM -0500, Joseph Fernandes wrote:

> We wanted to known the following
> 1) How could we improve the performance on the write side so that we have minimal latency?
> 2) Will ther be any write performance hit when the number of records in the DB increase?

Generally speaking, you have to do some work to arrange your data (modification
times) in some ordered way. This work can be done eihter in advance, as you
suggest, or on demand, as some people have already told you. But anyway this
will eat up necessary resources, regardless of whose code would do it, either
yours or SQLite. In practice (given magnetic disks as underlying storage),
most scarce of the mentioned resources is rotational/seek latency, which
detrimentally affects all disk operations of any scheduled priority. SQLite
performs extensive random disk access (mostly reads) on most operation
scenarios - selects, inserts, indexing etc. with possible exception of
small updates of non-indexed data (which are accessed in a similar fashion
by later selects). The only way to cope with the slow disk is keeping all
necessary data somwhere else, for example, into the RAM cache. Of course,
cache itself should be populated in advance to give this benefit, and, given
current RAM prices, it seems not very feasible to steal available memory from
smart applications in favour of dumb cache.

Hope, this considerations will help you in tuning your code.

Valentin Davydov.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

Simon Slavin-3

On 25 Nov 2014, at 1:34am, Joseph Fernandes <[hidden email]> wrote:

> a. Does WAL during an insert/update in the log file do any internal search/sort
> and then insert/update to the log or
> b. Just appends the WAL log with the incoming insert/update entry, thus keeping the
>    writes sequential and during a checkpoint (manual or automatic) does the merging with the actual tables,
>    which as you pointed out will have to search/sort causing alot of random disk access?

An INSERT in WAL mode does far more than just add something to the end of the journal file.  Numerous operations including, for examples, allocating pages and updating the primary index.  It will slow operations quite a bit.

I'm with some other posters on this: when logging modifications don't use SQLite, just append to a text/octet file in a simple short format.  Only when it comes time to do your maintenance read that data from the text files into SQLite.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Using Sqlite3 as a Change Time Recording Data Store in Glusterfs

Eduardo Morras-2
In reply to this post by Joseph Fernandes
On Sun, 23 Nov 2014 14:34:23 -0500 (EST)
Joseph Fernandes <[hidden email]> wrote:

> Ok I get it now. Yes we are using a single db connection object, But
> few questions,
> 1) how would making sqlite3 single thread that improve the
> performance? Is there a special advantage in this mode than the
> multithread one?

The locking code isn't compiled and sqlite3 don't spend time checking them. The code in single thread compiled with multithread is minimal, but it's there wasting time.

> 2) Will it not block (wait or give a SQLITE_BUSY) the multiple thread
> of the app (in our case glusterfs IO threads) as now they have to
> wait for the single thread to complete the task?

Sqlite3 can manage only one writer (INSERT/UPDATE/DELETE) and multiple readers, the whole db is lock when a writer writes. Don't matter if sqlite3 is in multithread mode or not.

You must implement in your sqlite3 thread owner, the control logic between writing threads. Your threads send data to sqlite3 thread and it decides, with your rules, which ones and in what order apply to db.

When I did it, I found particular cases in my data management logic that make it faster, f.ex. batch writes sended by some threads to update the same row multiple times, so only the last one was applied or apply them in one transaction, which is faster than apply them one by one.

> ~Joe


---   ---
Eduardo Morras <[hidden email]>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users