Bulk load strategy

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Bulk load strategy

Joseph L. Casale
I am trying to bulk load about a million records each with ~20 related records
into two tables. I am using WAL journal mode, synchronous is off and temp_store
is memory. The source data is static and the database will only be used as a means
to generate reporting and is not vital. I am deferring index creation to after the load.
The load proceeds along quickly to about 150k records where I encounter statements
which perform modifications to previous entries. The incoming data is structured
this way and has relational dependencies so these modifications spread throughout
affect subsequent inserts.

In a scenario such as this, what is the recommended approach?

Thanks,
jlc
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Bulk load strategy

Richard Hipp-3
On 5/17/17, Joseph L. Casale <[hidden email]> wrote:

> I am trying to bulk load about a million records each with ~20 related
> records
> into two tables. I am using WAL journal mode, synchronous is off and
> temp_store
> is memory. The source data is static and the database will only be used as a
> means
> to generate reporting and is not vital. I am deferring index creation to
> after the load.
> The load proceeds along quickly to about 150k records where I encounter
> statements
> which perform modifications to previous entries. The incoming data is
> structured
> this way and has relational dependencies so these modifications spread
> throughout
> affect subsequent inserts.
>
> In a scenario such as this, what is the recommended approach?

I think I would set "PRAGMA journal_mode=OFF;" since you will not be
using ROLLBACK.  Put the entire bulk load inside a single transaction,
and make the cache as big as you can, depending on the amount of RAM
you have on your system.

--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Bulk load strategy

Clemens Ladisch
In reply to this post by Joseph L. Casale
Joseph L. Casale wrote:
> I am deferring index creation to after the load.
> The load proceeds along quickly to about 150k records where I encounter statements
> which perform modifications to previous entries.

Without an index, searching for a previous entry is likely to involve
a scan through the entire table.  It might be a better idea to have the
index available here, even with the additional cost of updating it.


Regards,
Clemens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Bulk load strategy

Joseph L. Casale
> From: sqlite-users [mailto:[hidden email]] On
> Behalf Of Clemens Ladisch
> Sent: Wednesday, May 17, 2017 6:04 AM
> To: [hidden email]
> Subject: Re: [sqlite] Bulk load strategy
>
> Without an index, searching for a previous entry is likely to involve
> a scan through the entire table.  It might be a better idea to have the
> index available here, even with the additional cost of updating it.

While that showed true, both approaches are still too slow. Maintaining the
data in memory in order to facilitate the potential manipulation before
persisting it far exceeds the workstations memory capacity of 12Gb so
I need to come up with a new strategy.

I tried adding ANALYZE statements periodically to update the indexes
however it seemed not to matter, I also tried committing transactions
before the ANALYZE at the same interval without any success.

Anyone have any other suggestions?

Thanks guys,
jlc
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Bulk load strategy

Richard Hipp-3
Can you send ore details about your data and the updates and indexes
you are using?

On 5/17/17, Joseph L. Casale <[hidden email]> wrote:

>> From: sqlite-users [mailto:[hidden email]]
>> On
>> Behalf Of Clemens Ladisch
>> Sent: Wednesday, May 17, 2017 6:04 AM
>> To: [hidden email]
>> Subject: Re: [sqlite] Bulk load strategy
>>
>> Without an index, searching for a previous entry is likely to involve
>> a scan through the entire table.  It might be a better idea to have the
>> index available here, even with the additional cost of updating it.
>
> While that showed true, both approaches are still too slow. Maintaining the
> data in memory in order to facilitate the potential manipulation before
> persisting it far exceeds the workstations memory capacity of 12Gb so
> I need to come up with a new strategy.
>
> I tried adding ANALYZE statements periodically to update the indexes
> however it seemed not to matter, I also tried committing transactions
> before the ANALYZE at the same interval without any success.
>
> Anyone have any other suggestions?
>
> Thanks guys,
> jlc
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>


--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Bulk load strategy

Joseph L. Casale
> From: sqlite-users [mailto:[hidden email]] On
> Behalf Of Richard Hipp
> Sent: Wednesday, May 17, 2017 8:54 AM
> To: SQLite mailing list <[hidden email]>
> Subject: Re: [sqlite] Bulk load strategy
>
> Can you send ore details about your data and the updates and indexes
> you are using?

Sure, the database will be used to generate a myriad of custom reports based on
Active Directory data of specific  types. Some of the reports are not simple
in that they involve cross referencing attributes of one object such as sIDHistory
with attributes of another such as objectSid.

CREATE TABLE AdObject (
    Id                INTEGER PRIMARY KEY NOT NULL,
    DistinguishedName TEXT    NOT NULL COLLATE NOCASE,
    SamAccountName    TEXT    COLLATE NOCASE
);
CREATE UNIQUE INDEX AdObject_idx_0 ON AdObject (
    DistinguishedName
);
CREATE INDEX AdObject_idx_1 ON AdObject (
    SamAccountName
);

CREATE TABLE AdAttribute (
    Id       INTEGER PRIMARY KEY NOT NULL,
    Type     TEXT    NOT NULL COLLATE NOCASE,
    Value    TEXT    NOT NULL COLLATE NOCASE,
    AdObjectId INTEGER NOT NULL REFERENCES AdObject ON DELETE CASCADE ON UPDATE CASCADE
);
CREATE INDEX AdAttribute_idx_0 ON AdAttribute (
    Type
);
CREATE INDEX AdAttribute_idx_1 ON AdAttribute (
    Value
);
CREATE INDEX AdAttribute_idx_2 ON AdAttribute (
    AdObjectId
);

The bulk of the inserts look like:
INSERT INTO AdObject
      (DistinguishedName, SamAccountName)
  VALUES
      (@DistinguishedName, @SamAccountName);

INSERT OR IGNORE INTO AdAttribute
      (Type, Value, AdObjectId)
  VALUES
      (@Type, @Value, @AdObjectId);

(just noticed that IGNORE in the second query which serves no purpose).

Things grind to a halt when I start the following:

INSERT OR IGNORE INTO AdAttribute
      (Type, Value, AdObjectId)
  VALUES
      (@Type, @Value, (SELECT Id FROM AdObject WHERE DistinguishedName = @DistinguishedName));

The IGNORE above is required as the input data may ask to modify attributes for which no record exists.

Thanks for the help,
jlc
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Bulk load strategy

Gerry Snyder-4
In reply to this post by Joseph L. Casale
On Wed, May 17, 2017 at 3:52 AM, Joseph L. Casale <[hidden email]
> wrote:

> I am trying to bulk load about a million records each with ~20 related
> records
> into two tables. I am using WAL journal mode, synchronous is off and
> temp_store
> is memory. The source data is static and the database will only be used as
> a means
> to generate reporting and is not vital. I am deferring index creation to
> after the load.
> The load proceeds along quickly to about 150k records where I encounter
> statements
> which perform modifications to previous entries. The incoming data is
> structured
> this way and has relational dependencies so these modifications spread
> throughout
> affect subsequent inserts.
>
> In a scenario such as this, what is the recommended approach?
>
> Thanks,
> jlc
>


> If the updates pertain just to the 150k rows immediately preceding them,
> could you put each 150k chunk into its own table, and then do a join when
> accessing the data? Or even a merge at that point? Could be a lot faster.
>
> Gerry Snyder
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Bulk load strategy

Joseph L. Casale
> -----Original Message-----
> From: sqlite-users [mailto:[hidden email]] On
> Behalf Of Gerry Snyder
> Sent: Wednesday, May 17, 2017 9:14 AM
> To: SQLite mailing list <[hidden email]>
> Subject: Re: [sqlite] Bulk load strategy
>
> If the updates pertain just to the 150k rows immediately preceding them,
> could you put each 150k chunk into its own table, and then do a join when
> accessing the data? Or even a merge at that point? Could be a lot faster.

Hi Gerry,
The updates would refer to past entries, however I have no idea when and
how often they appear. The complicating factor is that future records in the
source data may reflect past changes introduced and so I cannot defer them.

I certainly can alter the strategy, I am just not clear on exactly what you suggest?

Thanks!
jlc
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Bulk load strategy

Simon Slavin-3
In reply to this post by Joseph L. Casale

On 17 May 2017, at 4:06pm, Joseph L. Casale <[hidden email]> wrote:

> CREATE TABLE AdObject (
>    Id                INTEGER PRIMARY KEY NOT NULL,
>    DistinguishedName TEXT    NOT NULL COLLATE NOCASE,
>    SamAccountName    TEXT    COLLATE NOCASE
> );
> CREATE UNIQUE INDEX AdObject_idx_0 ON AdObject (
>    DistinguishedName
> );
> CREATE INDEX AdObject_idx_1 ON AdObject (
>    SamAccountName
> );
>
> CREATE TABLE AdAttribute (
>    Id       INTEGER PRIMARY KEY NOT NULL,
>    Type     TEXT    NOT NULL COLLATE NOCASE,
>    Value    TEXT    NOT NULL COLLATE NOCASE,
>    AdObjectId INTEGER NOT NULL REFERENCES AdObject ON DELETE CASCADE ON UPDATE CASCADE
> );
> CREATE INDEX AdAttribute_idx_0 ON AdAttribute (
>    Type
> );
> CREATE INDEX AdAttribute_idx_1 ON AdAttribute (
>    Value
> );
> CREATE INDEX AdAttribute_idx_2 ON AdAttribute (
>    AdObjectId
> );
>
> The bulk of the inserts look like:
> INSERT INTO AdObject
>      (DistinguishedName, SamAccountName)
>  VALUES
>      (@DistinguishedName, @SamAccountName);
>
> INSERT OR IGNORE INTO AdAttribute
>      (Type, Value, AdObjectId)
>  VALUES
>      (@Type, @Value, @AdObjectId);

Fastest way to do bulk inserts would be to delete all the indexes which don’t play any part in identifying duplicates, then do the inserting, then remake the indexes.

So if you never insert duplicates on AdObject(DistinguishedName), DROP that index.  And definitely DROP all the others.  Then do your users.  The reCREATE the indexes.

As for the inserts themselves, batching them up in transactions of 10,000 seems acceptably fast.  But depending on the amount of memory you have free 100,000 may be faster.  Or maybe even 2,000.  You’ll have to try it out.

I’m questioning the point of AdAttribute_idx_0 and AdAttribute_idx_1.  It’s rare to usefully index values without types, for instance.  Do you actually have a SELECT which uses that one ?  Wouldn’t it be more efficient to do

        CREATE INDEX AdAttribute_idx_tv ON AdAttribute (
           Type, Value
        );

?  That’s assuming that even that one gets used at all, since it seems far more likely that you’d use (AdObjectId,Type).

> (just noticed that IGNORE in the second query which serves no purpose).

That won’t slow down SQLite much.  Don’t worry about it.

> Things grind to a halt when I start the following:
>
> INSERT OR IGNORE INTO AdAttribute
>      (Type, Value, AdObjectId)
>  VALUES
>      (@Type, @Value, (SELECT Id FROM AdObject WHERE DistinguishedName = @DistinguishedName));

The sub-select is killing you.  Since it’s identical for all the INSERT commands I suggest that you do that first, and keep the results in memory as a lookup table, or a hashed table, or a dictionary, or whatever your preferred language does.  You can look up those values in RAM far more quickly than SQLite can do the required file handling.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Bulk load strategy

David Raymond
The unique index on DistinguishedName though is what gets used for that sub query of the insert, so most definitely keep that one index for the whole load. (The others can be left out until the end though)

Otherwise, as was mentioned, journal_mode = off, synchronous = off, a large cache_size, and running it all in 1 transaction are gonna be the things that help the most.

You shouldn't need to do intermediate analyze runs, the defaults should use the correct indexes.


-----Original Message-----
From: sqlite-users [mailto:[hidden email]] On Behalf Of Simon Slavin
Sent: Wednesday, May 17, 2017 12:05 PM
To: SQLite mailing list
Subject: Re: [sqlite] Bulk load strategy


So if you never insert duplicates on AdObject(DistinguishedName), DROP that index.  And definitely DROP all the others.  Then do your users.  The reCREATE the indexes.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Bulk load strategy

Joseph L. Casale
In reply to this post by Simon Slavin-3
> -----Original Message-----
> From: sqlite-users [mailto:[hidden email]] On
> Behalf Of Simon Slavin
> Sent: Wednesday, May 17, 2017 10:05 AM
> To: SQLite mailing list <[hidden email]>
> Subject: Re: [sqlite] Bulk load strategy

> I’m questioning the point of AdAttribute_idx_0 and AdAttribute_idx_1.  It’s
> rare to usefully index values without types, for instance.  Do you actually
> have a SELECT which uses that one ?  Wouldn’t it be more efficient to do
>
> CREATE INDEX AdAttribute_idx_tv ON AdAttribute (
>   Type, Value
> );
>
> ?  That’s assuming that even that one gets used at all, since it seems far more
> likely that you’d use (AdObjectId,Type).

There were some instances where I need to search all values regardless of the type.
However, I see that approach could make changes I don't intend.

The use case was a moddn, however you may be right and I should constrain that
to types of "member" and "memberOf". Otherwise I could modify a free form text
field for which I have no authority over.

So I have one query which if I expect if I encounter will be painful:

UPDATE AdAttribute
     SET Value = @NewValue
   WHERE Type = @Type
         AND Value = @Value;

I may pass member or memberOf to @type, without the indexes this will be abysmal.
I don't expect to see this often and I don't have data that requires it in my large data set.

However good catch.

> The sub-select is killing you.  Since it’s identical for all the INSERT commands I
> suggest that you do that first, and keep the results in memory as a lookup
> table, or a hashed table, or a dictionary, or whatever your preferred language
> does.  You can look up those values in RAM far more quickly than SQLite can
> do the required file handling.

It seems I provided some bad numbers, I passed -w instead of -l to `wc` when
providing figures, I have ~160k records. The application processed at roughly
constant speed and finished quickly.

Brilliant Simon and thank you everyone for the guidance.

jlc
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Bulk load strategy

Simon Slavin-3

On 17 May 2017, at 7:07pm, Joseph L. Casale <[hidden email]> wrote:

> So I have one query which if I expect if I encounter will be painful:
>
> UPDATE AdAttribute
>     SET Value = @NewValue
>   WHERE Type = @Type
>         AND Value = @Value;
>
> I may pass member or memberOf to @type, without the indexes this will be abysmal.
> I don't expect to see this often and I don't have data that requires it in my large data set.

In this case your two separate indexes, one on Type and one on Value, will not provide much help.  The ideal index for this case is the combination one on (Type, Value).  Creating this instead of the two you listed, will dramatically speed up the UPDATE command, and reduce the INSERT time and size of the database.

In terms of the other things you mentioned, I see nothing obvious you’ve missed and I think you have enough information to proceed.

Simon.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Bulk load strategy

Olaf Schmidt-2
In reply to this post by David Raymond
Am 17.05.2017 um 19:08 schrieb David Raymond:

> The unique index on DistinguishedName though is what gets used for that sub query of the insert, so most definitely keep that one index for the whole load. (The others can be left out until the end though)
>

I once had a similar scenario, and solved it with good speed -
by following Simons suggestion to Drop all indexes first -
and then I've "manually ensured Uniqueness" over a DB-
independent, normal HashList for fast "Exists-lookups".

After the import went through (with about 300000 records/sec),
I've freed the HashList and recreated the Indexes on the DB.

Not sure though, how many unique "DistinguishedNames" Joseph
has to manage in his scenario (risking out-of-memory on the
Hash-Container).

Olaf

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Bulk load strategy

David Raymond
The key point I was thinking of for keeping that index was that it was perfect for speeding up the foreign key check / subquery for this part. I wasn't thinking at all in terms of unique enforcement.

INSERT OR IGNORE INTO AdAttribute
      (Type, Value, AdObjectId)
  VALUES
      (@Type, @Value, (SELECT Id FROM AdObject WHERE DistinguishedName = @DistinguishedName));

But yeah, keeping track of that in your enveloping program is a option.

-----Original Message-----
From: sqlite-users [mailto:[hidden email]] On Behalf Of Olaf Schmidt
Sent: Wednesday, May 17, 2017 3:40 PM
To: [hidden email]
Subject: Re: [sqlite] Bulk load strategy

Am 17.05.2017 um 19:08 schrieb David Raymond:

> The unique index on DistinguishedName though is what gets used for that sub query of the insert, so most definitely keep that one index for the whole load. (The others can be left out until the end though)
>

I once had a similar scenario, and solved it with good speed -
by following Simons suggestion to Drop all indexes first -
and then I've "manually ensured Uniqueness" over a DB-
independent, normal HashList for fast "Exists-lookups".

After the import went through (with about 300000 records/sec),
I've freed the HashList and recreated the Indexes on the DB.

Not sure though, how many unique "DistinguishedNames" Joseph
has to manage in his scenario (risking out-of-memory on the
Hash-Container).

Olaf

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Bulk load strategy

Simon Slavin-3
In reply to this post by Simon Slavin-3

On 17 May 2017, at 5:05pm, Simon Slavin <[hidden email]> wrote:

> Fastest way to do bulk inserts would be to delete all the indexes which don’t play any part in identifying duplicates, then do the inserting, then remake the indexes.

I forgot: once you’ve remade the indexes run ANALYZE.  That’s the only time you need to do it.  At that time you have typical data in the tables and SQLite will be able to gather all the data it needs to figure out good strategies.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Bulk load strategy

Joseph L. Casale
> -----Original Message-----
> From: sqlite-users [mailto:[hidden email]] On
> Behalf Of Simon Slavin
> Sent: Wednesday, May 17, 2017 2:02 PM
> To: SQLite mailing list <[hidden email]>
> Subject: Re: [sqlite] Bulk load strategy
 
> I forgot: once you’ve remade the indexes run ANALYZE.  That’s the only time
> you need to do it.  At that time you have typical data in the tables and SQLite
> will be able to gather all the data it needs to figure out good strategies.

Thanks for all the assistance guys, its working within reasonable limits on the
hardware given.

The unique constraint on a distinguished name is more about data integrity.
I am associating data against that value and it doesn't make sense to have more
than one. So if an "add" comes along unexpectedly (instead of an "update"), the
best way to know something is awry is for everything to turn pear shaped.

Kinda makes it hard for bugs to go unnoticed:)

Much appreciated everyone,
jlc
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users