Tracking item history using SQLite

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Tracking item history using SQLite

Random Coder
First off, if this sort of "code review" style question is
inappropriate for this list, please feel free to reply to me directly
and tell me to stop, and I'll know to not do this again.

That said, I have a question on the use of SQLite.  At the risk of
falling into the XY problem, I'll give some details on what i'm doing,
and then how I'm doing it.  My basic question is if what I'm doing is
valid, and if I'm doing anything needlessly wasteful.

I have a system monitoring events (and I know I'm being vague on the
exact nature of these events, I can't provide too many details, sorry.
Please try to accept what I say as given about them).  The events have
IDs that are generated externally, they're 30-40 ascii characters
long, appear random, and known to be unique by external means for a
given event.  For the purposes of this particular problem, the only
thing I care about tracking is when I first saw an event, and the last
time I saw it.  For better or worse, this ecosystem already thinks
about timestamps as the number of minutes since a specific epoch, and
is used to treating all time values as an integer in that space, I'm
doing the same here.

So, I have a RESTful server written in Python, using APSW to create a
simple SQLite database:

    CREATE TABLE IF NOT EXISTS
        event(
            event_id TEXT PRIMARY KEY,
            first_seen INTEGER,
            last_seen INTEGER
        ) WITHOUT ROWID;

Every time a new event comes in (they might be very out of order), I do a:

    INSERT INTO event(event_id, first_seen, last_seen) VALUES(?, ?, ?)
    ON CONFLICT(event_id) DO UPDATE SET
        first_seen = MIN(excluded.first_seen, hashes.first_seen),
        last_seen = MAX(excluded.last_seen, hashes.last_seen);

To create the record for the event if it's new, or possibly update an
existing one with new values.  To give a sense of scale, I have around
5 billion events stored right now for the past couple of years in a
250gb database, and I see around 20 million per day, some small
percentage of those are new.

The important thing I can do for users is pull up reports.  The report
is roughly a summary of how old events are (when they were first seen,
and how long they've been seen for).  Outliers are highlighted, as are
events that haven't been seen at all.  The user will provide around
ten thousand event IDs, the majority of them, approaching 99%, will
exist in my database.  When the user requests a report, I create an in
memory database:

   ATTACH ':memory:' AS mem_db;
   CREATE TABLE mem_db.valid(event_id TEXT PRIMARY KEY);

And populate that table with the events the user is interested in.
I'm doing this since I won't get the list of items in one list, it's
built up over some minutes.  If the system running dies in the middle
of a request, it's OK to start over.  Then I run:

    SELECT
        mem_db.valid.event_id,
        event.first_seen,
        event.last_seen
    FROM
        mem_db.valid
    LEFT JOIN event ON
        event.event_id = mem_db.valid.event_id;

And gather up the results and pretty them up for the user.

Does all of this seem valid?  It works, so I'm OK with it, but I'm far
from a SQLite expert, and I want to know if I'm going to be backing
myself into a corner or otherwise torturing things that should be done
differently.  Or, if the answer is: "Don't use SQLite for that",
that's fine too, I'll start looking at other options.

Thanks for any feedback.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Tracking item history using SQLite

Keith Medcalf

On Friday, 30 August, 2019 14:45, Random Coder <[hidden email]> wrote:

>    CREATE TABLE IF NOT EXISTS
>        event(
>            event_id TEXT PRIMARY KEY,
>            first_seen INTEGER,
>            last_seen INTEGER
>        ) WITHOUT ROWID;

So first_seen and last_seen are permitted to be NULL?

>    INSERT INTO event(event_id, first_seen, last_seen) VALUES(?, ?, ?)
>    ON CONFLICT(event_id) DO UPDATE SET
>        first_seen = MIN(excluded.first_seen, hashes.first_seen),
>        last_seen = MAX(excluded.last_seen, hashes.last_seen);

MAX(NULL, <anything not null>) -> NULL
MIN(NULL, <anything not null>) -> NULL

>   ATTACH ':memory:' AS mem_db;
>   CREATE TABLE mem_db.valid(event_id TEXT PRIMARY KEY);

>And populate that table with the events the user is interested in.
>I'm doing this since I won't get the list of items in one list, it's
>built up over some minutes.  If the system running dies in the middle
>of a request, it's OK to start over.  Then I run:

>    SELECT
>        mem_db.valid.event_id,
>        event.first_seen,
>        event.last_seen
>    FROM
>        mem_db.valid
>    LEFT JOIN event ON
>        event.event_id = mem_db.valid.event_id;

>And gather up the results and pretty them up for the user.

>Does all of this seem valid?  It works, so I'm OK with it, but I'm
>far from a SQLite expert, and I want to know if I'm going to be
>backing myself into a corner or otherwise torturing things that
>should be done differently.  

Seems fine, other than that event.first_seen and event.last_seen can be NULL, in which case that field will never be updated.  So while you may claim that you never store NULL in those fields, doing so will cause non-workage due to integrity failure, and the purpose of a DBMS is to enforce integrity.

>Thanks for any feedback.

Just my 2 cents.

--
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.




_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Tracking item history using SQLite

Random Coder
On Fri, Aug 30, 2019 at 3:35 PM Keith Medcalf <[hidden email]> wrote:
> Seems fine, other than that event.first_seen and event.last_seen can be NULL, in which case that field will never be updated.  So while you may claim that you never store NULL in those fields, doing so will cause non-workage due to integrity failure,

Good point.  There are no code paths that could put NULL in there now,
but as you say, I should make sure that's the case in the design, in
case the insert logic changes in the future.  Thanks for catching
this.

> and the purpose of a DBMS is to enforce integrity.

And thanks a ton for this comment, I need to get in this mentality, clearly.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Tracking item history using SQLite

Rob Willett
In reply to this post by Random Coder
Hi,

We have a very similar system that captures traffic incident information
such as accidents, roadworks, traffic jams and sends personalised
information to drivers based on their routes.

We realised after a few years that our original design for the database
was inefficient as we had a single table that was about 60GB, circa 500M
rows, that consisted of traffic incidents and uptakes. The inefficiency
was our design and not SQLite.

This inefficiency showed when we used to run a daily report that
summarised the traffic information for historical trends, e.g. we wanted
to know the average time for a road accident to be cleared on a stretch
of road on a specific day or time of day. This report mean we read in
circa 500M rows and 60GB of data each night and took circa 90 mins to
run as we did a lot of calculations.

We were not able to change the structure of the database too much (as it
was too far down the line). So we spent a long time looking at the data
with cold, wet towels on our foreheads in darkened rooms :) After we
analysed the data we realised that most of our 'updates' were not
actually updates at all an all were were actually interested in was the
start and end times of incidents, all the 'stuff' in the middle was just
noise.  The problem we had was that disruptions could last a long time
and there was no actual end of incident marker, just that there wasn't
another disruption with that unique identifier and a later timestamp.
This sounds similar to your situation. Basically we constructed a simple
state model based on time.

What we now do is to each night we run a Perl script each night to
remove anything that is between the first and end incidents. We know
that the end incident may move on the next day, but at that point in
time it is still the last/end/most recent indicator. Our data is
structured around time so we always have a time of insertion indicator
in the field as things may not be ordered by row id.

Once we had this insight, we took our database down from 60GB to 800MB
and the report that ran from 90 mins to 90 secs. We have to give credit
to this mailing list for the patient and courteous way that helped and
assisted with our often stupid questions.

The lessons we learnt from this are:

1. Get your database schema right from the go. We didn't. In hindsight
we should have spent longer looking at this.

2. Put in more information than you think you may need just in case. We
did that. We put time information in at per second granularity.

3. Don't worry too much about disk space unless you are on a constrained
device. In hindsight we could have stored even more data :)

4. Think hard about solutions that are simple. We love simplicity as
this makes it easier to go back and change things. Too many dependencies
would be too difficult to unpick. Keeping it simple also works for us as
we're not experts :)

5. SQLite seems to be able to do anything we want it to. We know the
'limitations' of it not being a full client/server database, just to be
clear, we do not consider this a limitation at all but rather a virtue.
Other people seem worried about the 'lack' of some datatypes, we do
masses of data and date conversations as needed and it's never been a
speed issue or any issue.

6. Ask this group for help. We realised that the more information we
provided in a clear and concise manner in our help email, the more help
we got, it was a virtuous circle. The sum help of the people in this
group far, far exceeds our own knowledge. As a rule of thumb, if we ask
for help, we expect it to take 1-2 hours as a minimum for us to write
the email. That may be because we're not SQL experts. Also follow up
every email as if people have taken the time to reply to us, they
deserve a reply themselves. As we are English, we could end up in a
thanks for the reply endless loop :) The group has an exceptionally
signal to noise ratio and is invariably courteous.

I would be astonished if you can't get SQLite to do what you want to do.
We have never managed to touch the sides of the system and suspect we
never will.

More than happy to answer more questions as this group helped us and
it's only fair we offer the help back. I will state that we are not SQL
(or SQLite) experts :)

Rob

On 30 Aug 2019, at 21:44, Random Coder wrote:

> First off, if this sort of "code review" style question is
> inappropriate for this list, please feel free to reply to me directly
> and tell me to stop, and I'll know to not do this again.
>
> That said, I have a question on the use of SQLite.  At the risk of
> falling into the XY problem, I'll give some details on what i'm doing,
> and then how I'm doing it.  My basic question is if what I'm doing is
> valid, and if I'm doing anything needlessly wasteful.
>
> I have a system monitoring events (and I know I'm being vague on the
> exact nature of these events, I can't provide too many details, sorry.
> Please try to accept what I say as given about them).  The events have
> IDs that are generated externally, they're 30-40 ascii characters
> long, appear random, and known to be unique by external means for a
> given event.  For the purposes of this particular problem, the only
> thing I care about tracking is when I first saw an event, and the last
> time I saw it.  For better or worse, this ecosystem already thinks
> about timestamps as the number of minutes since a specific epoch, and
> is used to treating all time values as an integer in that space, I'm
> doing the same here.
>
> So, I have a RESTful server written in Python, using APSW to create a
> simple SQLite database:
>
>     CREATE TABLE IF NOT EXISTS
>         event(
>             event_id TEXT PRIMARY KEY,
>             first_seen INTEGER,
>             last_seen INTEGER
>         ) WITHOUT ROWID;
>
> Every time a new event comes in (they might be very out of order), I
> do a:
>
>     INSERT INTO event(event_id, first_seen, last_seen) VALUES(?, ?, ?)
>     ON CONFLICT(event_id) DO UPDATE SET
>         first_seen = MIN(excluded.first_seen, hashes.first_seen),
>         last_seen = MAX(excluded.last_seen, hashes.last_seen);
>
> To create the record for the event if it's new, or possibly update an
> existing one with new values.  To give a sense of scale, I have around
> 5 billion events stored right now for the past couple of years in a
> 250gb database, and I see around 20 million per day, some small
> percentage of those are new.
>
> The important thing I can do for users is pull up reports.  The report
> is roughly a summary of how old events are (when they were first seen,
> and how long they've been seen for).  Outliers are highlighted, as are
> events that haven't been seen at all.  The user will provide around
> ten thousand event IDs, the majority of them, approaching 99%, will
> exist in my database.  When the user requests a report, I create an in
> memory database:
>
>    ATTACH ':memory:' AS mem_db;
>    CREATE TABLE mem_db.valid(event_id TEXT PRIMARY KEY);
>
> And populate that table with the events the user is interested in.
> I'm doing this since I won't get the list of items in one list, it's
> built up over some minutes.  If the system running dies in the middle
> of a request, it's OK to start over.  Then I run:
>
>     SELECT
>         mem_db.valid.event_id,
>         event.first_seen,
>         event.last_seen
>     FROM
>         mem_db.valid
>     LEFT JOIN event ON
>         event.event_id = mem_db.valid.event_id;
>
> And gather up the results and pretty them up for the user.
>
> Does all of this seem valid?  It works, so I'm OK with it, but I'm far
> from a SQLite expert, and I want to know if I'm going to be backing
> myself into a corner or otherwise torturing things that should be done
> differently.  Or, if the answer is: "Don't use SQLite for that",
> that's fine too, I'll start looking at other options.
>
> Thanks for any feedback.
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Tracking item history using SQLite

CedricCicada
Why are you storing first_seen in every record?  To avoid searching for it
when reports are generated?

On Sat, Aug 31, 2019 at 6:24 AM Rob Willett <[hidden email]>
wrote:

> Hi,
>
> We have a very similar system that captures traffic incident information
> such as accidents, roadworks, traffic jams and sends personalised
> information to drivers based on their routes.
>
> We realised after a few years that our original design for the database
> was inefficient as we had a single table that was about 60GB, circa 500M
> rows, that consisted of traffic incidents and uptakes. The inefficiency
> was our design and not SQLite.
>
> This inefficiency showed when we used to run a daily report that
> summarised the traffic information for historical trends, e.g. we wanted
> to know the average time for a road accident to be cleared on a stretch
> of road on a specific day or time of day. This report mean we read in
> circa 500M rows and 60GB of data each night and took circa 90 mins to
> run as we did a lot of calculations.
>
> We were not able to change the structure of the database too much (as it
> was too far down the line). So we spent a long time looking at the data
> with cold, wet towels on our foreheads in darkened rooms :) After we
> analysed the data we realised that most of our 'updates' were not
> actually updates at all an all were were actually interested in was the
> start and end times of incidents, all the 'stuff' in the middle was just
> noise.  The problem we had was that disruptions could last a long time
> and there was no actual end of incident marker, just that there wasn't
> another disruption with that unique identifier and a later timestamp.
> This sounds similar to your situation. Basically we constructed a simple
> state model based on time.
>
> What we now do is to each night we run a Perl script each night to
> remove anything that is between the first and end incidents. We know
> that the end incident may move on the next day, but at that point in
> time it is still the last/end/most recent indicator. Our data is
> structured around time so we always have a time of insertion indicator
> in the field as things may not be ordered by row id.
>
> Once we had this insight, we took our database down from 60GB to 800MB
> and the report that ran from 90 mins to 90 secs. We have to give credit
> to this mailing list for the patient and courteous way that helped and
> assisted with our often stupid questions.
>
> The lessons we learnt from this are:
>
> 1. Get your database schema right from the go. We didn't. In hindsight
> we should have spent longer looking at this.
>
> 2. Put in more information than you think you may need just in case. We
> did that. We put time information in at per second granularity.
>
> 3. Don't worry too much about disk space unless you are on a constrained
> device. In hindsight we could have stored even more data :)
>
> 4. Think hard about solutions that are simple. We love simplicity as
> this makes it easier to go back and change things. Too many dependencies
> would be too difficult to unpick. Keeping it simple also works for us as
> we're not experts :)
>
> 5. SQLite seems to be able to do anything we want it to. We know the
> 'limitations' of it not being a full client/server database, just to be
> clear, we do not consider this a limitation at all but rather a virtue.
> Other people seem worried about the 'lack' of some datatypes, we do
> masses of data and date conversations as needed and it's never been a
> speed issue or any issue.
>
> 6. Ask this group for help. We realised that the more information we
> provided in a clear and concise manner in our help email, the more help
> we got, it was a virtuous circle. The sum help of the people in this
> group far, far exceeds our own knowledge. As a rule of thumb, if we ask
> for help, we expect it to take 1-2 hours as a minimum for us to write
> the email. That may be because we're not SQL experts. Also follow up
> every email as if people have taken the time to reply to us, they
> deserve a reply themselves. As we are English, we could end up in a
> thanks for the reply endless loop :) The group has an exceptionally
> signal to noise ratio and is invariably courteous.
>
> I would be astonished if you can't get SQLite to do what you want to do.
> We have never managed to touch the sides of the system and suspect we
> never will.
>
> More than happy to answer more questions as this group helped us and
> it's only fair we offer the help back. I will state that we are not SQL
> (or SQLite) experts :)
>
> Rob
>
> On 30 Aug 2019, at 21:44, Random Coder wrote:
>
> > First off, if this sort of "code review" style question is
> > inappropriate for this list, please feel free to reply to me directly
> > and tell me to stop, and I'll know to not do this again.
> >
> > That said, I have a question on the use of SQLite.  At the risk of
> > falling into the XY problem, I'll give some details on what i'm doing,
> > and then how I'm doing it.  My basic question is if what I'm doing is
> > valid, and if I'm doing anything needlessly wasteful.
> >
> > I have a system monitoring events (and I know I'm being vague on the
> > exact nature of these events, I can't provide too many details, sorry.
> > Please try to accept what I say as given about them).  The events have
> > IDs that are generated externally, they're 30-40 ascii characters
> > long, appear random, and known to be unique by external means for a
> > given event.  For the purposes of this particular problem, the only
> > thing I care about tracking is when I first saw an event, and the last
> > time I saw it.  For better or worse, this ecosystem already thinks
> > about timestamps as the number of minutes since a specific epoch, and
> > is used to treating all time values as an integer in that space, I'm
> > doing the same here.
> >
> > So, I have a RESTful server written in Python, using APSW to create a
> > simple SQLite database:
> >
> >     CREATE TABLE IF NOT EXISTS
> >         event(
> >             event_id TEXT PRIMARY KEY,
> >             first_seen INTEGER,
> >             last_seen INTEGER
> >         ) WITHOUT ROWID;
> >
> > Every time a new event comes in (they might be very out of order), I
> > do a:
> >
> >     INSERT INTO event(event_id, first_seen, last_seen) VALUES(?, ?, ?)
> >     ON CONFLICT(event_id) DO UPDATE SET
> >         first_seen = MIN(excluded.first_seen, hashes.first_seen),
> >         last_seen = MAX(excluded.last_seen, hashes.last_seen);
> >
> > To create the record for the event if it's new, or possibly update an
> > existing one with new values.  To give a sense of scale, I have around
> > 5 billion events stored right now for the past couple of years in a
> > 250gb database, and I see around 20 million per day, some small
> > percentage of those are new.
> >
> > The important thing I can do for users is pull up reports.  The report
> > is roughly a summary of how old events are (when they were first seen,
> > and how long they've been seen for).  Outliers are highlighted, as are
> > events that haven't been seen at all.  The user will provide around
> > ten thousand event IDs, the majority of them, approaching 99%, will
> > exist in my database.  When the user requests a report, I create an in
> > memory database:
> >
> >    ATTACH ':memory:' AS mem_db;
> >    CREATE TABLE mem_db.valid(event_id TEXT PRIMARY KEY);
> >
> > And populate that table with the events the user is interested in.
> > I'm doing this since I won't get the list of items in one list, it's
> > built up over some minutes.  If the system running dies in the middle
> > of a request, it's OK to start over.  Then I run:
> >
> >     SELECT
> >         mem_db.valid.event_id,
> >         event.first_seen,
> >         event.last_seen
> >     FROM
> >         mem_db.valid
> >     LEFT JOIN event ON
> >         event.event_id = mem_db.valid.event_id;
> >
> > And gather up the results and pretty them up for the user.
> >
> > Does all of this seem valid?  It works, so I'm OK with it, but I'm far
> > from a SQLite expert, and I want to know if I'm going to be backing
> > myself into a corner or otherwise torturing things that should be done
> > differently.  Or, if the answer is: "Don't use SQLite for that",
> > that's fine too, I'll start looking at other options.
> >
> > Thanks for any feedback.
> > _______________________________________________
> > sqlite-users mailing list
> > [hidden email]
> > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Tracking item history using SQLite

Dominique Devienne
In reply to this post by Rob Willett
On Sat, Aug 31, 2019 at 12:24 PM Rob Willett <[hidden email]>
wrote:

> 5. SQLite seems to be able to do anything we want it to. [...]
> Other people seem worried about the 'lack' of some datatypes, we do
> masses of data and date conversations as needed and it's never been a
> speed issue or any issue.


 (since I'm often one of those "other people", I feel compelled to reply to
that one)

As Keith wrote above in this thread, it's all about "integrity", and why
I'd want more datatypes in SQLite.

An integer column (e.g. number of seconds since Epoc, or gregorian days, or
else) or a
text column (e.g. RFC XYZ datetime, local-TZ or not) says nothing about
that column, and certainly
does not enforce anything by itself. Rare are the people actually adding
CHECK constraints to enforce those.
So having more specialized datatypes provides more semantic information in
the schema itself,
and that a good thing, a very good thing indeed.

Of course you can do anything with SQLite despite that, or the "flexible
typing" dear to DRH,
just like you can code anything in a duck-typing scripting language like
you can in a statically
typed language. But more typing does help in the long run IMHO, and is very
valuable. FWIW... --DD
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Tracking item history using SQLite

Rob Willett
In reply to this post by CedricCicada
Hi,

We aren't storing first_seen in every row. Each incident is something
like

Date_Time_of_update_in_epoch_secs1 unique_incident_number <loads of
information about the incident>
Date_Time_of_update_in_epoch_secs2 unique_incident_number <loads of
information about the incident>
Date_Time_of_update_in_epoch_secs3 unique_incident_number <loads of
information about the incident>
Date_Time_of_update_in_epoch_secs4 unique_incident_number <loads of
information about the incident>
Date_Time_of_update_in_epoch_secs5 unique_incident_number <loads of
information about the incident>

These incidents are spread over time, but normally there are around
300-600 secs apart. There is no guarantee that each update has an update
for that unique_incident_number in, so we could have an update with an
incident missing, and then one 10 mins later with the incident in. The
data feed is wholly out of our control and it may be that a traffic
officer has not been able to add things in. We have to manage data being
incomplete but we've learnt how to now.

What we do each night is work out the stuff in the middle of the first
and last incident (based on the unique_incident_number) and throw the
middle stuff away. There are other heuristics that determine what we
throw away, so we built a small state machine to work it all out. The
first time we ran the system, it took about three hours to run and work
out what to throw. Thats when we found out that dropping a table in
SQLite is quite time expensive. However once you do the first run,
subsequent runs take around 90 secs. This means that the report we run
to work out the averages of incidents dropped from 90 mins to around 90
secs.

Most of our stuff is written in the worlds most hated language, Perl,
which we find easy to use and fast. No idea why people don't like it ;)

We can't do this in real time as it's too much work, so we run it at
03:35 each day on a cronjob.

Hope this helps.

Rob

On 31 Aug 2019, at 14:45, Rob Richardson wrote:

> Why are you storing first_seen in every record?  To avoid searching
> for it
> when reports are generated?
>
> On Sat, Aug 31, 2019 at 6:24 AM Rob Willett
> <[hidden email]>
> wrote:
>
>> Hi,
>>
>> We have a very similar system that captures traffic incident
>> information
>> such as accidents, roadworks, traffic jams and sends personalised
>> information to drivers based on their routes.
>>
>> We realised after a few years that our original design for the
>> database
>> was inefficient as we had a single table that was about 60GB, circa
>> 500M
>> rows, that consisted of traffic incidents and uptakes. The
>> inefficiency
>> was our design and not SQLite.
>>
>> This inefficiency showed when we used to run a daily report that
>> summarised the traffic information for historical trends, e.g. we
>> wanted
>> to know the average time for a road accident to be cleared on a
>> stretch
>> of road on a specific day or time of day. This report mean we read in
>> circa 500M rows and 60GB of data each night and took circa 90 mins to
>> run as we did a lot of calculations.
>>
>> We were not able to change the structure of the database too much (as
>> it
>> was too far down the line). So we spent a long time looking at the
>> data
>> with cold, wet towels on our foreheads in darkened rooms :) After we
>> analysed the data we realised that most of our 'updates' were not
>> actually updates at all an all were were actually interested in was
>> the
>> start and end times of incidents, all the 'stuff' in the middle was
>> just
>> noise.  The problem we had was that disruptions could last a long
>> time
>> and there was no actual end of incident marker, just that there
>> wasn't
>> another disruption with that unique identifier and a later timestamp.
>> This sounds similar to your situation. Basically we constructed a
>> simple
>> state model based on time.
>>
>> What we now do is to each night we run a Perl script each night to
>> remove anything that is between the first and end incidents. We know
>> that the end incident may move on the next day, but at that point in
>> time it is still the last/end/most recent indicator. Our data is
>> structured around time so we always have a time of insertion
>> indicator
>> in the field as things may not be ordered by row id.
>>
>> Once we had this insight, we took our database down from 60GB to
>> 800MB
>> and the report that ran from 90 mins to 90 secs. We have to give
>> credit
>> to this mailing list for the patient and courteous way that helped
>> and
>> assisted with our often stupid questions.
>>
>> The lessons we learnt from this are:
>>
>> 1. Get your database schema right from the go. We didn't. In
>> hindsight
>> we should have spent longer looking at this.
>>
>> 2. Put in more information than you think you may need just in case.
>> We
>> did that. We put time information in at per second granularity.
>>
>> 3. Don't worry too much about disk space unless you are on a
>> constrained
>> device. In hindsight we could have stored even more data :)
>>
>> 4. Think hard about solutions that are simple. We love simplicity as
>> this makes it easier to go back and change things. Too many
>> dependencies
>> would be too difficult to unpick. Keeping it simple also works for us
>> as
>> we're not experts :)
>>
>> 5. SQLite seems to be able to do anything we want it to. We know the
>> 'limitations' of it not being a full client/server database, just to
>> be
>> clear, we do not consider this a limitation at all but rather a
>> virtue.
>> Other people seem worried about the 'lack' of some datatypes, we do
>> masses of data and date conversations as needed and it's never been a
>> speed issue or any issue.
>>
>> 6. Ask this group for help. We realised that the more information we
>> provided in a clear and concise manner in our help email, the more
>> help
>> we got, it was a virtuous circle. The sum help of the people in this
>> group far, far exceeds our own knowledge. As a rule of thumb, if we
>> ask
>> for help, we expect it to take 1-2 hours as a minimum for us to write
>> the email. That may be because we're not SQL experts. Also follow up
>> every email as if people have taken the time to reply to us, they
>> deserve a reply themselves. As we are English, we could end up in a
>> thanks for the reply endless loop :) The group has an exceptionally
>> signal to noise ratio and is invariably courteous.
>>
>> I would be astonished if you can't get SQLite to do what you want to
>> do.
>> We have never managed to touch the sides of the system and suspect we
>> never will.
>>
>> More than happy to answer more questions as this group helped us and
>> it's only fair we offer the help back. I will state that we are not
>> SQL
>> (or SQLite) experts :)
>>
>> Rob
>>
>> On 30 Aug 2019, at 21:44, Random Coder wrote:
>>
>>> First off, if this sort of "code review" style question is
>>> inappropriate for this list, please feel free to reply to me
>>> directly
>>> and tell me to stop, and I'll know to not do this again.
>>>
>>> That said, I have a question on the use of SQLite.  At the risk of
>>> falling into the XY problem, I'll give some details on what i'm
>>> doing,
>>> and then how I'm doing it.  My basic question is if what I'm doing
>>> is
>>> valid, and if I'm doing anything needlessly wasteful.
>>>
>>> I have a system monitoring events (and I know I'm being vague on the
>>> exact nature of these events, I can't provide too many details,
>>> sorry.
>>> Please try to accept what I say as given about them).  The events
>>> have
>>> IDs that are generated externally, they're 30-40 ascii characters
>>> long, appear random, and known to be unique by external means for a
>>> given event.  For the purposes of this particular problem, the only
>>> thing I care about tracking is when I first saw an event, and the
>>> last
>>> time I saw it.  For better or worse, this ecosystem already thinks
>>> about timestamps as the number of minutes since a specific epoch,
>>> and
>>> is used to treating all time values as an integer in that space, I'm
>>> doing the same here.
>>>
>>> So, I have a RESTful server written in Python, using APSW to create
>>> a
>>> simple SQLite database:
>>>
>>>     CREATE TABLE IF NOT EXISTS
>>>         event(
>>>             event_id TEXT PRIMARY KEY,
>>>             first_seen INTEGER,
>>>             last_seen INTEGER
>>>         ) WITHOUT ROWID;
>>>
>>> Every time a new event comes in (they might be very out of order), I
>>> do a:
>>>
>>>     INSERT INTO event(event_id, first_seen, last_seen) VALUES(?, ?,
>>> ?)
>>>     ON CONFLICT(event_id) DO UPDATE SET
>>>         first_seen = MIN(excluded.first_seen, hashes.first_seen),
>>>         last_seen = MAX(excluded.last_seen, hashes.last_seen);
>>>
>>> To create the record for the event if it's new, or possibly update
>>> an
>>> existing one with new values.  To give a sense of scale, I have
>>> around
>>> 5 billion events stored right now for the past couple of years in a
>>> 250gb database, and I see around 20 million per day, some small
>>> percentage of those are new.
>>>
>>> The important thing I can do for users is pull up reports.  The
>>> report
>>> is roughly a summary of how old events are (when they were first
>>> seen,
>>> and how long they've been seen for).  Outliers are highlighted, as
>>> are
>>> events that haven't been seen at all.  The user will provide around
>>> ten thousand event IDs, the majority of them, approaching 99%, will
>>> exist in my database.  When the user requests a report, I create an
>>> in
>>> memory database:
>>>
>>>    ATTACH ':memory:' AS mem_db;
>>>    CREATE TABLE mem_db.valid(event_id TEXT PRIMARY KEY);
>>>
>>> And populate that table with the events the user is interested in.
>>> I'm doing this since I won't get the list of items in one list, it's
>>> built up over some minutes.  If the system running dies in the
>>> middle
>>> of a request, it's OK to start over.  Then I run:
>>>
>>>     SELECT
>>>         mem_db.valid.event_id,
>>>         event.first_seen,
>>>         event.last_seen
>>>     FROM
>>>         mem_db.valid
>>>     LEFT JOIN event ON
>>>         event.event_id = mem_db.valid.event_id;
>>>
>>> And gather up the results and pretty them up for the user.
>>>
>>> Does all of this seem valid?  It works, so I'm OK with it, but I'm
>>> far
>>> from a SQLite expert, and I want to know if I'm going to be backing
>>> myself into a corner or otherwise torturing things that should be
>>> done
>>> differently.  Or, if the answer is: "Don't use SQLite for that",
>>> that's fine too, I'll start looking at other options.
>>>
>>> Thanks for any feedback.
>>> _______________________________________________
>>> sqlite-users mailing list
>>> [hidden email]
>>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>> _______________________________________________
>> sqlite-users mailing list
>> [hidden email]
>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users