Performance Issue on Large Table

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Performance Issue on Large Table

Chip Beaulieu
I have a table with 4.5 million records with full text indexing. Reads are very fast, but deleting / inserting  / updating takes on average about 50 seconds per record. I often do batches of 30,000 deletes / inserts at a time. The last batch took 10 hours to complete.

Here are the details:

Table Schema:

CREATE TABLE `Verses` (
        `ID` integer PRIMARY KEY AUTOINCREMENT,
        `BibleID` integer NOT NULL,
        `Book` integer NOT NULL,
        `Chapter` integer NOT NULL,
        `VerseNumber` integer NOT NULL,
        `Scripture` text,
        `Comments` text,
        FOREIGN KEY(`BibleID`) REFERENCES `Bibles`(`BibleID`) ON DELETE CASCADE,
        FOREIGN KEY(`Book`) REFERENCES `Books`(`ID`)
)

Indexes:

CREATE INDEX `INX_Verses_BID` ON `Verses` (`BibleID`)
CREATE INDEX `INX_Verses_BID_B_C` ON `Verses` (`BibleID`,`Book`,`Chapter`)
CREATE INDEX `INX_Verses_BI_B_C_V` ON `Verses` (`BibleID`,`Book`,`Chapter`,`VerseNumber`)
CREATE INDEX `INX_Verses_B_C_V` ON `Verses` (`Book`,`Chapter`,`VerseNumber`)

Triggers on table:

CREATE TRIGGER after_verses_delete AFTER DELETE ON Verses
BEGIN
  —the scripture table is the FTS5 table
    DELETE FROM scriptures WHERE ID = old.ID;   —the scripture table is the FTS5 table
END

CREATE TRIGGER after_verses_insert AFTER INSERT ON Verses
BEGIN
  —the scripture table is the FTS5 table
  INSERT INTO scriptures (ID,Scripture) VALUES(new.ID, new.Scripture);
END

CREATE TRIGGER after_verses_update UPDATE OF Scripture ON Verses
BEGIN
  —the scripture table is the FTS5 table
  UPDATE scriptures SET Scripture = new.Scripture WHERE ID = old.ID;
END

I run the set of transactions from the command line on a mac using the .read command. A common scenario deletes 30,000 records, then inserts a new set of  30,000 records. Here are the parameters I set up for the transaction in the .read file. To be honest, I’ve not seen much of an improvement with these and since the batch takes so long to complete, I was concerned tweaking other pragma statements that might introduce risk of corruption on crashes or power failure.

select time('now','-5 hours');
pragma temp_store=2;  --memory
pragma jouurnal_mode=TRUNCATE; --default is DELETE
pragma locking_mode=EXCLUSIVE; --default is NORMAL
BEGIN TRANSACTION;
DELETE FROM Verses WHERE BibleID=38;
INSERT INTO VERSES (BibleID, Book, Chapter, VerseNumber, Scripture) VALUES(38,1,1,1,'<p>·In the beginning God created the heavens and the earth.');

… repeat 30,000 times with other records

COMMIT;
select time('now','-5 hours’);

I also recently vacuumed the file. The issue really showed up after the FTS5 was setup on the table. I suspect it’s got something to do with the triggers more than the indexes. I am definitely open to suggestions. I’ve  not been able to find much info on the internet to optimize updates to tables with FTS.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Performance Issue on Large Table

Luuk

On 21-2-2020 02:24, Chip Beaulieu wrote:

> I have a table with 4.5 million records with full text indexing. Reads are very fast, but deleting / inserting  / updating takes on average about 50 seconds per record. I often do batches of 30,000 deletes / inserts at a time. The last batch took 10 hours to complete.
>
> Here are the details:
>
> Table Schema:
>
> CREATE TABLE `Verses` (
> `ID` integer PRIMARY KEY AUTOINCREMENT,
> `BibleID` integer NOT NULL,
> `Book` integer NOT NULL,
> `Chapter` integer NOT NULL,
> `VerseNumber` integer NOT NULL,
> `Scripture` text,
> `Comments` text,

I would recommend to not store `Comments` in the same table as
`Scripture`, or are you trying to re-write the Bible? 😉

This will avoid the deletion of unchanged data, which is inserted later
on.... (only because `Comments` did change?)


--
Luuk

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Performance Issue on Large Table

Jens Alfke-2
In reply to this post by Chip Beaulieu

> On Feb 20, 2020, at 5:24 PM, Chip Beaulieu <[hidden email]> wrote:
>
> CREATE INDEX `INX_Verses_BID` ON `Verses` (`BibleID`)
> CREATE INDEX `INX_Verses_BID_B_C` ON `Verses` (`BibleID`,`Book`,`Chapter`)
> CREATE INDEX `INX_Verses_BI_B_C_V` ON `Verses` (`BibleID`,`Book`,`Chapter`,`VerseNumber`)

I believe the first two of those are unnecessary, since their column-lists are prefixes of the third. So removing them would buy you some performance.

(Disclaimer: not a guru.)

But you’ve said most of the problem is with FTS. When you update the database, does all of the indexed text really change? Or do many of the 30,000 new records contain the same text as their deleted counterparts? If the latter, you could optimize by not touching those rows.

It’s also possible that dropping the FTS table before the update, and re-creating it afterwards, would be faster than incrementally changing it.

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Performance Issue on Large Table

Olaf Schmidt-2
In reply to this post by Chip Beaulieu
Am 21.02.2020 um 02:24 schrieb Chip Beaulieu:
> I have a table with 4.5 million records with full text indexing.
>  > Reads are very fast, but deleting / inserting / updating
> takes on average about 50 seconds per record.
> I often do batches of 30,000 deletes / inserts at a time.
> The last batch took 10 hours to complete.

These 30,000 deletes+inserts indicates that these were the
verse-records of a complete Bible-Content, right?

And your 4.5Mio records total indicate, that you have about
140 different Bible-versions in your DB?

> I suspect it’s got something to do with the triggers more than the indexes.
Could be - but my guess is, that your primary DB-changes are caused
by "deleting or inserting verses of whole bible-contents".

So why not handle (hold) each bible in a separate table?

An amount of 140 tables in such a "BibleVersions.db" is not
uncommon and can be managed by SQLite in a good performance.

Inserting "your next Bible-Version" into it would be fast,
because you're creating separate Tables for that job.
Same thing for "deleting Bible-versions" (just 2 "Drop Table"-calls).

You might argue, that the perfomance will suffer - especially
when you're trying to find "differences between two versions"...

But that is only a matter of a Join (via VerseID) on
the two Tables you want to "get a Verse-Diff for".

I've played that through with two different "Bibles":
- Bible12.txt and Bible13.txt (each one about 5MB unzipped)
- downloadable here: http://www.gutenberg.org/files/30/old/

The 31,102 "verse-records" each of the two text-files contains,
can be parsed and inserted into separate FTS-Tables in about:
- 0.5sec each (incl. text-parsing)

So, the total import-time for both (Bible12 and Bible13) was ~ 1sec.

That's far below your timings for "inserting a single verse-record".

FWIW, here is the SQLite-DB, my imports have produced (about 7MB)
http://vbRichClient.com/Downloads/BibleVersions.zip

It contains 4 Tables now (after importing the 2 text-files):
- Books_Bible12 (66 records) + FTS_Bible12 (31,102 records)
- Books_Bible13 (66 records) + FTS_Bible13 (31,102 records)

The schema-defs for those "Books_xxx" and "FTS_xxx" tables are:
"Create Table <Books_xxx> (BookID Integer Primary Key, Title Text)"
"Create Virtual Table <FTS_xxx> Using FTS5(Scripture, tokenize=porter)"

No extra-indexes are needed... (IMO)

So, whilst the Books-Table was defined "quite normally" -
the FTS-table only contains a single Field ("Scripture")...
which raises the question, how to efficiently join 2 FTS-tables
(e.g. to find "differences in the Scripture-fields").

Well, in the recent Bible-TextFiles (from project Gutenberg),
theres already a "verse- or scripture-id" which is unique:
01:001:001 (two digits for BookNr, 3 for ChapterNr and 3 for VerseNr)

This Text-based (unique) ID can be easily transformed
to an Int-Value of the form: 01001001 - and that's what
I did whilst inserting into the FTS-Tables (filling their RowID-fields)

So, a later Join, to find scripture-differences between:
- FTS_Bible12 and FTS_Bible13 contents
becomes possible - e.g. this way:

Select T1.RowID, T1.Scripture, T2.Scripture
 From FTS_Bible12 T1 Join FTS_Bible13 T2 On T1.RowID=T2.RowID
Where T1.Scripture <> T2.Scripture

It will (on the given example-DB above) find 8 records,
where scripture-content differs (after about 50msec or so).


For those interested, below is the import-Code I've used - which
will run as either VBScript - but also within VBA or VB5/6...
(it reads like "pseudo-code", so porting to other languages is easy):

'**depends on the COM-wrapper for SQLite (available at vbRichClient.com)
Function ImportInto(oMemDB, sTblNameBooks, sTblNameFTS, sBibleContent)
   On Error Resume Next

   With oMemDB
     .BeginTrans
     .Exec "Create Table " & sTblNameBooks & " (BookID Integer Primary
Key, Title Text)"
     .Exec "Create Virtual Table " & sTblNameFTS & " Using
FTS5(Scripture, tokenize=porter)"

     Dim sCmdB: sCmdB = "Insert Into " & sTblNameBooks & " (BookID,
Title) Values(?,?)"
     Dim sCmdV: sCmdV = "Insert Into " & sTblNameFTS & "(RowID,
Scripture) Values(?,?)"

     Dim L, VID, S
     For Each L In Split(sBibleContent, vbCrLf) 'enumerate the Lines (in L)

       Select Case InStr(L, " ") 'check the Pos of the first Space-Char in L

         Case 1  'we are probably still in a verse-block...
           If Len(VID) Then S = S & Mid(L, 11) 'Ok, concat that Verse-Part

         Case 11 'a potential Verse-Idx-Line
           If InStr(L, ":") = 3 Then VID = Left(L, 10): S = Mid(L, 12)

         Case 0  'probably an empty Line (as in the end of a Verse-Block)
           If Len(VID) Then
             .ExecCmd CStr(sCmdV), CLng(Replace(VID, ":", "")), S 'RowID
+ Scripture
              VID = "" 'reset the VID for the next verse
           End If

         Case 5 'it's a potential Book-Line
           If Left(L, 4) = "Book" And Len(L) < 40 Then
              .ExecCmd CStr(sCmdB), CLng(Mid(L, 6, 2)), Trim(Mid(L, 10))
'BookID + Title
           End If
       End Select
     Next

     .CommitTrans
     If Err Then .RollbackTrans: ImportInto = Err.Description 'return
the ErrStr
   End With
End Function

HTH

Olaf


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Performance Issue on Large Table

Richard Damon
On 2/23/20 8:31 AM, Olaf Schmidt wrote:

> Am 21.02.2020 um 02:24 schrieb Chip Beaulieu:
>> I have a table with 4.5 million records with full text indexing.  >
>> Reads are very fast, but deleting / inserting / updating
>> takes on average about 50 seconds per record. I often do batches of
>> 30,000 deletes / inserts at a time. The last batch took 10 hours to
>> complete.
>
> These 30,000 deletes+inserts indicates that these were the
> verse-records of a complete Bible-Content, right?
>
> And your 4.5Mio records total indicate, that you have about
> 140 different Bible-versions in your DB?
>
>> I suspect it’s got something to do with the triggers more than the
>> indexes.
> Could be - but my guess is, that your primary DB-changes are caused
> by "deleting or inserting verses of whole bible-contents".
>
> So why not handle (hold) each bible in a separate table?
>
> An amount of 140 tables in such a "BibleVersions.db" is not
> uncommon and can be managed by SQLite in a good performance.

I'm not sure that form of division would be good. One basic rule of
database normalization is that you don't break-up data based on the
value of one of the fields because you can't make the table to lookup
data from be taken from a field you get in a query.

--
Richard Damon

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Performance Issue on Large Table

Dominique Pellé
In reply to this post by Chip Beaulieu
Chip Beaulieu <[hidden email]> wrote:


> I also recently vacuumed the file. The issue really showed up after the FTS5 was
> setup on the table. I suspect it’s got something to do with the triggers more than
> the indexes. I am definitely open to suggestions. I’ve  not been able to find much
> info on the internet to optimize updates to tables with FTS.

If the performance issue is about updating the FTS index, then maybe
tweaking the 'automerge' feature can help. See:
https://www.sqlite.org/fts5.html#the_automerge_configuration_option

If you don't need NEAR or phrase queries, then perhaps setting
details=column can help speeding up re-indexing (at least
it reduces the index size). See:
https://www.sqlite.org/fts5.html#the_detail_option

Regards
Dominique
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Performance Issue on Large Table

R Smith-2
In reply to this post by Richard Damon

On 2020/02/23 21:23, Richard Damon wrote:

> On 2/23/20 8:31 AM, Olaf Schmidt wrote:
>>
>> An amount of 140 tables in such a "BibleVersions.db" is not
>> uncommon and can be managed by SQLite in a good performance.
>
> I'm not sure that form of division would be good. One basic rule of
> database normalization is that you don't break-up data based on the
> value of one of the fields because you can't make the table to lookup
> data from be taken from a field you get in a query.
>

It is far less of a worse design than having an AFTER DELETE trigger
doing trivial processing when you expect to delete 30K records at a time
due to a "well-designed" 1NF schema.

Firstly, good design guides are only guides.
Secondly, having said that, either embrace the full idealism or none of
it, but doing some things the good way and others not, hampers the
"goodness" of the overall system.
Lastly, we usually use ON DELETE CASCADE Foreign Key relations to do
what you are doing with the trigger, but of course you are using FTS5
and I do not know if it can be done the FK way, or indeed how much the
FK way will be better considering the same Indexes remain.


Note - I don't have a good solution here, if it was me I would
immediately kill those triggers, see if it can be done smarter (or is
needed at all), or even if it can be done in code - OR - get rid of the
1NF schema and go with the suggested Bible-per-Table method and keep the
triggers (This will help nothing if the deletes are going to require
deleting 1 record (and its trigger) each through 30K tables. It's only
useful if it brings down the delete count).

If you cannot live without the triggers, nor can bring down the
delete-count, maybe do it in code, or one quick way I can think of is
using the ON DELETE to add those IDs to a "to_be_deleted" table and
simply delete everything fropm the second table with "WHERE ID IN
(SELECT ID FROM to_be_deleted)" before comitting the transaction,
perhaps combined with dropping and re-making the Indexes - but of course
this needs testing and might not be suitable for reasons I am unaware of
in your use case.


Either way, you are probably in for some lengthy processing.

Best of luck!
Ryan

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Performance Issue on Large Table

Simon Slavin-3
In reply to this post by Richard Damon
On 23 Feb 2020, at 7:23pm, Richard Damon <[hidden email]> wrote:

> I'm not sure that form of division would be good. One basic rule of database normalization is that you don't break-up data based on the value of one of the fields because you can't make the table to lookup data from be taken from a field you get in a query.

People who study the bible often want to look up different versions of the same verse at once, so they can compare the translations and wording.  This suggests that splitting up the text by different versions would be a bad idea.

I would rather figure out the underlying problem, but I made a decision some time ago not to understand the FTS parts of SQLite (because I never needed it for my own work).  So I'm sorry I can't help on the performance issues.  The timings just seem too long for a SQLite database which isn't using FTS.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Performance Issue on Large Table

Richard Damon
In reply to this post by R Smith-2
On 2/23/20 3:06 PM, R.Smith wrote:

>
> On 2020/02/23 21:23, Richard Damon wrote:
>> On 2/23/20 8:31 AM, Olaf Schmidt wrote:
>>>
>>> An amount of 140 tables in such a "BibleVersions.db" is not
>>> uncommon and can be managed by SQLite in a good performance.
>>
>> I'm not sure that form of division would be good. One basic rule of
>> database normalization is that you don't break-up data based on the
>> value of one of the fields because you can't make the table to lookup
>> data from be taken from a field you get in a query.
>>
>
> It is far less of a worse design than having an AFTER DELETE trigger
> doing trivial processing when you expect to delete 30K records at a
> time due to a "well-designed" 1NF schema.
>
> Firstly, good design guides are only guides.
> Secondly, having said that, either embrace the full idealism or none
> of it, but doing some things the good way and others not, hampers the
> "goodness" of the overall system.
> Lastly, we usually use ON DELETE CASCADE Foreign Key relations to do
> what you are doing with the trigger, but of course you are using FTS5
> and I do not know if it can be done the FK way, or indeed how much the
> FK way will be better considering the same Indexes remain.
>
>
> Note - I don't have a good solution here, if it was me I would
> immediately kill those triggers, see if it can be done smarter (or is
> needed at all), or even if it can be done in code - OR - get rid of
> the 1NF schema and go with the suggested Bible-per-Table method and
> keep the triggers (This will help nothing if the deletes are going to
> require deleting 1 record (and its trigger) each through 30K tables.
> It's only useful if it brings down the delete count).
>
> If you cannot live without the triggers, nor can bring down the
> delete-count, maybe do it in code, or one quick way I can think of is
> using the ON DELETE to add those IDs to a "to_be_deleted" table and
> simply delete everything fropm the second table with "WHERE ID IN
> (SELECT ID FROM to_be_deleted)" before comitting the transaction,
> perhaps combined with dropping and re-making the Indexes - but of
> course this needs testing and might not be suitable for reasons I am
> unaware of in your use case.
>
>
> Either way, you are probably in for some lengthy processing.
>
> Best of luck!
> Ryan

My thought is that a modification (making multiple tables, one per
version) that drops significant functionality (now version is hard coded
as part of the query, and can't be determined as part of a query) is
something not to lightly do.  It also isn't that likely to improve
things a lot unless it allows the massive delete cycle to be replaced
with a drop table.

One big question is why the apparent wholesale deletion of a version,
and if really needed perhaps a better schema that supports the
operations better.

--
Richard Damon

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Performance Issue on Large Table

Olaf Schmidt-2
In reply to this post by Richard Damon
Am 23.02.2020 um 20:23 schrieb Richard Damon:

>> An amount of 140 tables in such a "BibleVersions.db" is not
>> uncommon and can be managed by SQLite in a good performance.
>
> I'm not sure that form of division would be good. One basic rule of
> database normalization is that you don't break-up data based on the
> value of one of the fields ...

Sure, but FTS-"Tables" are in a "special category" (IMO)...

At least, I'd separate them "by language", because it does
not really make sense to me, to stuff e.g. the 31102 verses
of a japanese Bible-version into the same FTS-index,
where already one (or more) english versions reside.

Between japanese and english that's obvious already at the
"Unicode(Point)-level" - but even among "Latin-versions"
(e.g. english and german ones) there might be different
stemmer-algos to consider, to give optimal search-results.

Olaf

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Performance Issue on Large Table

cherie
I guess is this slowness is because of triggers (most probably) amplified
by extra indexes.

Lets discuss indexes first:
if most of queries uses BibleID as SARG along with Book & Chapter then
INX_Verses_BID
& INX_Verses_BID_B_C are not required. These are unnecessary adding
slowness to write activities.
Keep minimal indexes.

*Indexes:*
CREATE INDEX `INX_Verses_BID` ON `Verses` (`BibleID`)
CREATE INDEX `INX_Verses_BID_B_C` ON `Verses` (`BibleID`,`Book`,`Chapter`)
CREATE INDEX `INX_Verses_BI_B_C_V` ON `Verses`
(`BibleID`,`Book`,`Chapter`,`VerseNumber`)
CREATE INDEX `INX_Verses_B_C_V` ON `Verses` (`Book`,`Chapter`,`VerseNumber`)


Now lest's discuss triggers, these I think are the prime bottleneck for
your write activity. In fact for such bulk activity Sybase-ASE
<http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc32300.1570/html/sqlug/X33484.htm>
has solely separate command to "Disable Trigger". However this is not
present in Sqlite.

Triggers on table:

CREATE TRIGGER after_verses_delete AFTER DELETE ON Verses
BEGIN
  —the scripture table is the FTS5 table
    DELETE FROM scriptures WHERE ID = old.ID;   —the scripture table is the
FTS5 table
END

CREATE TRIGGER after_verses_insert AFTER INSERT ON Verses
BEGIN
  —the scripture table is the FTS5 table
  INSERT INTO scriptures (ID,Scripture) VALUES(new.ID, new.Scripture);
END

CREATE TRIGGER after_verses_update UPDATE OF Scripture ON Verses
BEGIN
  —the scripture table is the FTS5 table
  UPDATE scriptures SET Scripture = new.Scripture WHERE ID = old.ID;
END



Thanks,
Vikas


On Mon, 24 Feb 2020 at 23:38, Olaf Schmidt <[hidden email]> wrote:

> Am 23.02.2020 um 20:23 schrieb Richard Damon:
>
> >> An amount of 140 tables in such a "BibleVersions.db" is not
> >> uncommon and can be managed by SQLite in a good performance.
> >
> > I'm not sure that form of division would be good. One basic rule of
> > database normalization is that you don't break-up data based on the
> > value of one of the fields ...
>
> Sure, but FTS-"Tables" are in a "special category" (IMO)...
>
> At least, I'd separate them "by language", because it does
> not really make sense to me, to stuff e.g. the 31102 verses
> of a japanese Bible-version into the same FTS-index,
> where already one (or more) english versions reside.
>
> Between japanese and english that's obvious already at the
> "Unicode(Point)-level" - but even among "Latin-versions"
> (e.g. english and german ones) there might be different
> stemmer-algos to consider, to give optimal search-results.
>
> Olaf
>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users