Unexplained table bloat

classic Classic list List threaded Threaded
41 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Unexplained table bloat

Ryan Mack
Hi list,

I'm trying to understand unexplained table bloat I found in what should be
a very small table in an old database file. If you dump/restore the
database, the problem goes away. If you duplicate the table, the problem
propagates. Schema:

CREATE TABLE copied(
  id_local INT,
  LUTFullString,
  LUTHash
);

Test Data:
85960605|,|0DE19F8AA100D2962FF22C60B3FA8940
85960465|,|A44615408E8B3E48A684C60CA2967031
85960451|,|33C8804354984071A740788AD308B279

In my test database file, each of these 3 rows is allocating about 40 4k
overflow pages for a total database size of about 500k. The full database
has about 4MB of actual data which takes up over 500MB on disk. If you want
to see/reproduce the problem you'll need my test database file which I've
uploaded here: https://mackman.net/200k-per-row.sqlite.zip (500kb
download). I don't know why it doesn't compress better, those extra
overflow pages must be getting filled with random garbage.

My uninformed guess is there was a bug in the version of sqlite used at
database creation time that computed an incorrect overflow threshold and is
storing each byte of the row to its own page. Since the problem goes away
with a dump/restore, I'm considering releasing a script to do that and
mitigate the problem for affected users. Before doing that I would like to
understand the problem better.

Thanks for reading, Ryan

PS: Here's some output from my debug session showing the 123 bytes of data
is occupying 582k of space on disk in a freshly created table.

% sqlite3 200k-per-row.sqlite
SQLite version 3.28.0 2019-04-15 14:49:49
Enter ".help" for usage hints.
sqlite> .tables
copied

sqlite> .schema copied
CREATE TABLE copied(
  id_local INT,
  LUTFullString,
  LUTHash
);

sqlite> select * from copied;
85960605|,|0DE19F8AA100D2962FF22C60B3FA8940
85960465|,|A44615408E8B3E48A684C60CA2967031
85960451|,|33C8804354984071A740788AD308B279

sqlite> select sum(length(id_local) + length(lutfullstring) +
length(luthash)) from copied;
123

sqlite> create table copied2 as select * from copied;
sqlite> SELECT name, SUM("pgsize") as sz FROM dbstat group by name;
copied|581632
copied2|581632
sqlite_master|4096
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Unexplained table bloat

David Raymond
Well something's weird anyway. When I open it with the command line tool it queries it just fine. I tried to open it in Python to check all the characters in the strings and got this:

Traceback (most recent call last):
  File "...\Testing4.py", line 8, in <module>
    cur.execute("select * from copied;")
sqlite3.OperationalError: Could not decode to UTF-8 column 'LUTFullString' with text ','


Which I suppose it just as likely to be my own problem though.


-----Original Message-----
From: sqlite-users <[hidden email]> On Behalf Of Ryan Mack
Sent: Friday, January 10, 2020 8:48 AM
To: [hidden email]
Subject: [sqlite] Unexplained table bloat

Hi list,

I'm trying to understand unexplained table bloat I found in what should be
a very small table in an old database file. If you dump/restore the
database, the problem goes away. If you duplicate the table, the problem
propagates. Schema:

CREATE TABLE copied(
  id_local INT,
  LUTFullString,
  LUTHash
);

Test Data:
85960605|,|0DE19F8AA100D2962FF22C60B3FA8940
85960465|,|A44615408E8B3E48A684C60CA2967031
85960451|,|33C8804354984071A740788AD308B279

In my test database file, each of these 3 rows is allocating about 40 4k
overflow pages for a total database size of about 500k. The full database
has about 4MB of actual data which takes up over 500MB on disk. If you want
to see/reproduce the problem you'll need my test database file which I've
uploaded here: https://mackman.net/200k-per-row.sqlite.zip (500kb
download). I don't know why it doesn't compress better, those extra
overflow pages must be getting filled with random garbage.

My uninformed guess is there was a bug in the version of sqlite used at
database creation time that computed an incorrect overflow threshold and is
storing each byte of the row to its own page. Since the problem goes away
with a dump/restore, I'm considering releasing a script to do that and
mitigate the problem for affected users. Before doing that I would like to
understand the problem better.

Thanks for reading, Ryan

PS: Here's some output from my debug session showing the 123 bytes of data
is occupying 582k of space on disk in a freshly created table.

% sqlite3 200k-per-row.sqlite
SQLite version 3.28.0 2019-04-15 14:49:49
Enter ".help" for usage hints.
sqlite> .tables
copied

sqlite> .schema copied
CREATE TABLE copied(
  id_local INT,
  LUTFullString,
  LUTHash
);

sqlite> select * from copied;
85960605|,|0DE19F8AA100D2962FF22C60B3FA8940
85960465|,|A44615408E8B3E48A684C60CA2967031
85960451|,|33C8804354984071A740788AD308B279

sqlite> select sum(length(id_local) + length(lutfullstring) +
length(luthash)) from copied;
123

sqlite> create table copied2 as select * from copied;
sqlite> SELECT name, SUM("pgsize") as sz FROM dbstat group by name;
copied|581632
copied2|581632
sqlite_master|4096
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Unexplained table bloat

Simon Slavin-3

On 10 Jan 2020, at 2:06pm, David Raymond <[hidden email]> wrote:

> Well something's weird anyway. When I open it with the command line tool it queries it just fine

Did you run an integrity_check on the database ?  It looks from your posts as if it's corrupt.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Unexplained table bloat

Ryan Mack
In reply to this post by Ryan Mack
Thank you, I was unaware of the integrity_check pragma. It returns OK
for the database in question.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Unexplained table bloat

Richard Hipp-3
In reply to this post by Ryan Mack
On 1/10/20, Ryan Mack <[hidden email]> wrote:
>
> I'm trying to understand unexplained table bloat

The sqlite3_analyzer command-line utility program (available  in the
"Precompiled binaries" bundles on the https://sqlite.org/download.html
page) is designed to help understand these kinds of problems.  Please
run that utility on the database and perhaps post the output here.
--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Unexplained table bloat

Ryan Mack
Accidentally sent my first reply direct, responding to the list. I'm
now wondering if there's a lot of binary data hidden in each row.
Trying to figure out how to determine that if length() doesn't show
anything.


Prior response:

An excellent idea, thank you :-) .  The output is included below.

I am continuing to do my own debugging in parallel. I am now leaning
towards a new hypothesis that there may be a bug handling
strange/malformed unicode that is resulting in a large amount of
garbage data being stored into the record. I'm trying to figure out
the exact conditions of copying/loading data into the table that
causes the issue to propagate or disappear.


 % ./sqlite3_analyzer ~/Downloads/200k-per-row.sqlite
/** Disk-Space Utilization Report For /Users/nadia/Downloads/200k-per-row.sqlite

Page size in bytes................................ 4096
Pages in the whole file (measured)................ 143
Pages in the whole file (calculated).............. 143
Pages that store data............................. 143        100.0%
Pages on the freelist (per header)................ 0            0.0%
Pages on the freelist (calculated)................ 0            0.0%
Pages of auto-vacuum overhead..................... 0            0.0%
Number of tables in the database.................. 2
Number of indices................................. 0
Number of defined indices......................... 0
Number of implied indices......................... 0
Size of the file in bytes......................... 585728
Bytes of user payload stored...................... 571322      97.5%

*** Page counts for all tables with their indices *****************************

COPIED............................................ 142         99.30%
SQLITE_MASTER..................................... 1            0.70%

*** Page counts for all tables and indices separately *************************

COPIED............................................ 142         99.30%
SQLITE_MASTER..................................... 1            0.70%

*** All tables ****************************************************************

Percentage of total database...................... 100.0%
Number of entries................................. 4
Bytes of storage consumed......................... 585728
Bytes of payload.................................. 571412      97.6%
Bytes of metadata................................. 1284         0.22%
Average payload per entry......................... 142853.00
Average unused bytes per entry.................... 3393.00
Average metadata per entry........................ 321.00
Average fanout.................................... 3.00
Maximum payload per entry......................... 194280
Entries that use overflow......................... 3           75.0%
Index pages used.................................. 1
Primary pages used................................ 4
Overflow pages used............................... 138
Total pages used.................................. 143
Unused bytes on index pages....................... 4070        99.37%
Unused bytes on primary pages..................... 9502        58.0%
Unused bytes on overflow pages.................... 0            0.0%
Unused bytes on all pages......................... 13572        2.3%

*** Table COPIED **************************************************************

Percentage of total database......................  99.30%
Number of entries................................. 3
Bytes of storage consumed......................... 581632
Bytes of payload.................................. 571322      98.2%
Bytes of metadata................................. 1172         0.20%
B-tree depth...................................... 2
Average payload per entry......................... 190440.67
Average unused bytes per entry.................... 3226.00
Average metadata per entry........................ 390.67
Average fanout.................................... 3.00
Non-sequential pages.............................. 0            0.0%
Maximum payload per entry......................... 194280
Entries that use overflow......................... 3          100.0%
Index pages used.................................. 1
Primary pages used................................ 3
Overflow pages used............................... 138
Total pages used.................................. 142
Unused bytes on index pages....................... 4070        99.37%
Unused bytes on primary pages..................... 5608        45.6%
Unused bytes on overflow pages.................... 0            0.0%
Unused bytes on all pages......................... 9678         1.7%

*** Table SQLITE_MASTER *******************************************************

Percentage of total database......................   0.70%
Number of entries................................. 1
Bytes of storage consumed......................... 4096
Bytes of payload.................................. 90           2.2%
Bytes of metadata................................. 112          2.7%
B-tree depth...................................... 1
Average payload per entry......................... 90.00
Average unused bytes per entry.................... 3894.00
Average metadata per entry........................ 112.00
Maximum payload per entry......................... 90
Entries that use overflow......................... 0            0.0%
Primary pages used................................ 1
Overflow pages used............................... 0
Total pages used.................................. 1
Unused bytes on primary pages..................... 3894        95.1%
Unused bytes on overflow pages.................... 0
Unused bytes on all pages......................... 3894        95.1%

On Fri, Jan 10, 2020 at 9:45 AM Richard Hipp <[hidden email]> wrote:

>
> On 1/10/20, Ryan Mack <[hidden email]> wrote:
> >
> > I'm trying to understand unexplained table bloat
>
> The sqlite3_analyzer command-line utility program (available  in the
> "Precompiled binaries" bundles on the https://sqlite.org/download.html
> page) is designed to help understand these kinds of problems.  Please
> run that utility on the database and perhaps post the output here.
> --
> D. Richard Hipp
> [hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Unexplained table bloat

Ryan Mack
OK, I think I've got a better sense now. Hex encoding the column shows
that there's actually a huge amount of data stored in there. For some
reason length() isn't revealing it even if the column type is blob.
Dumping and restoring the table is truncating the data.

On Fri, Jan 10, 2020 at 9:58 AM Ryan Mack <[hidden email]> wrote:

>
> Accidentally sent my first reply direct, responding to the list. I'm
> now wondering if there's a lot of binary data hidden in each row.
> Trying to figure out how to determine that if length() doesn't show
> anything.
>
>
> Prior response:
>
> An excellent idea, thank you :-) .  The output is included below.
>
> I am continuing to do my own debugging in parallel. I am now leaning
> towards a new hypothesis that there may be a bug handling
> strange/malformed unicode that is resulting in a large amount of
> garbage data being stored into the record. I'm trying to figure out
> the exact conditions of copying/loading data into the table that
> causes the issue to propagate or disappear.
>
>
>  % ./sqlite3_analyzer ~/Downloads/200k-per-row.sqlite
> /** Disk-Space Utilization Report For /Users/nadia/Downloads/200k-per-row.sqlite
>
> Page size in bytes................................ 4096
> Pages in the whole file (measured)................ 143
> Pages in the whole file (calculated).............. 143
> Pages that store data............................. 143        100.0%
> Pages on the freelist (per header)................ 0            0.0%
> Pages on the freelist (calculated)................ 0            0.0%
> Pages of auto-vacuum overhead..................... 0            0.0%
> Number of tables in the database.................. 2
> Number of indices................................. 0
> Number of defined indices......................... 0
> Number of implied indices......................... 0
> Size of the file in bytes......................... 585728
> Bytes of user payload stored...................... 571322      97.5%
>
> *** Page counts for all tables with their indices *****************************
>
> COPIED............................................ 142         99.30%
> SQLITE_MASTER..................................... 1            0.70%
>
> *** Page counts for all tables and indices separately *************************
>
> COPIED............................................ 142         99.30%
> SQLITE_MASTER..................................... 1            0.70%
>
> *** All tables ****************************************************************
>
> Percentage of total database...................... 100.0%
> Number of entries................................. 4
> Bytes of storage consumed......................... 585728
> Bytes of payload.................................. 571412      97.6%
> Bytes of metadata................................. 1284         0.22%
> Average payload per entry......................... 142853.00
> Average unused bytes per entry.................... 3393.00
> Average metadata per entry........................ 321.00
> Average fanout.................................... 3.00
> Maximum payload per entry......................... 194280
> Entries that use overflow......................... 3           75.0%
> Index pages used.................................. 1
> Primary pages used................................ 4
> Overflow pages used............................... 138
> Total pages used.................................. 143
> Unused bytes on index pages....................... 4070        99.37%
> Unused bytes on primary pages..................... 9502        58.0%
> Unused bytes on overflow pages.................... 0            0.0%
> Unused bytes on all pages......................... 13572        2.3%
>
> *** Table COPIED **************************************************************
>
> Percentage of total database......................  99.30%
> Number of entries................................. 3
> Bytes of storage consumed......................... 581632
> Bytes of payload.................................. 571322      98.2%
> Bytes of metadata................................. 1172         0.20%
> B-tree depth...................................... 2
> Average payload per entry......................... 190440.67
> Average unused bytes per entry.................... 3226.00
> Average metadata per entry........................ 390.67
> Average fanout.................................... 3.00
> Non-sequential pages.............................. 0            0.0%
> Maximum payload per entry......................... 194280
> Entries that use overflow......................... 3          100.0%
> Index pages used.................................. 1
> Primary pages used................................ 3
> Overflow pages used............................... 138
> Total pages used.................................. 142
> Unused bytes on index pages....................... 4070        99.37%
> Unused bytes on primary pages..................... 5608        45.6%
> Unused bytes on overflow pages.................... 0            0.0%
> Unused bytes on all pages......................... 9678         1.7%
>
> *** Table SQLITE_MASTER *******************************************************
>
> Percentage of total database......................   0.70%
> Number of entries................................. 1
> Bytes of storage consumed......................... 4096
> Bytes of payload.................................. 90           2.2%
> Bytes of metadata................................. 112          2.7%
> B-tree depth...................................... 1
> Average payload per entry......................... 90.00
> Average unused bytes per entry.................... 3894.00
> Average metadata per entry........................ 112.00
> Maximum payload per entry......................... 90
> Entries that use overflow......................... 0            0.0%
> Primary pages used................................ 1
> Overflow pages used............................... 0
> Total pages used.................................. 1
> Unused bytes on primary pages..................... 3894        95.1%
> Unused bytes on overflow pages.................... 0
> Unused bytes on all pages......................... 3894        95.1%
>
> On Fri, Jan 10, 2020 at 9:45 AM Richard Hipp <[hidden email]> wrote:
> >
> > On 1/10/20, Ryan Mack <[hidden email]> wrote:
> > >
> > > I'm trying to understand unexplained table bloat
> >
> > The sqlite3_analyzer command-line utility program (available  in the
> > "Precompiled binaries" bundles on the https://sqlite.org/download.html
> > page) is designed to help understand these kinds of problems.  Please
> > run that utility on the database and perhaps post the output here.
> > --
> > D. Richard Hipp
> > [hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Unexplained table bloat

Richard Hipp-3
On 1/10/20, Ryan Mack <[hidden email]> wrote:
> OK, I think I've got a better sense now. Hex encoding the column shows
> that there's actually a huge amount of data stored in there. For some
> reason length() isn't revealing it even if the column type is blob.
> Dumping and restoring the table is truncating the data.

length() on a BLOB should show the number of bytes in the BLOB.

length() on a string should show the number of *characters* (not
bytes) in the string up through but not including the first
zero-character.  It is possible to have additional content after the
first zero-character in a string, which length() will not tell you
about.

--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Unexplained table bloat

Simon Slavin-3
In reply to this post by Ryan Mack
On 10 Jan 2020, at 3:11pm, Ryan Mack <[hidden email]> wrote:

> OK, I think I've got a better sense now. Hex encoding the column shows that there's actually a huge amount of data stored in there. For some reason length() isn't revealing it even if the column type is blob. Dumping and restoring the table is truncating the data.

Is there a chance that some part of the software thinks that 0x00 is a terminator for BLOBs, and some other part ignores the 0x00 and abides strictly by the length ?  I've seen that before.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Unexplained table bloat

Dominique Devienne
In reply to this post by Richard Hipp-3
On Fri, Jan 10, 2020 at 4:30 PM Richard Hipp <[hidden email]> wrote:
> length() on a BLOB should show the number of bytes in the BLOB.
>
> length() on a string should show the number of *characters* (not
> bytes) in the string up through but not including the first
> zero-character.  It is possible to have additional content after the
> first zero-character in a string, which length() will not tell you about.

Hi Richard,

There's no way at all, to know the length of a text column with embedded NULLs?

--DD
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Unexplained table bloat

Richard Hipp-3
On 1/10/20, Dominique Devienne <[hidden email]> wrote:
>
> There's no way at all, to know the length of a text column with embedded
> NULLs?
>

You can find the true length of a string in bytes from C-code using
the sqlite3_column_bytes() interface.  But I cannot, off-hand, think
of a way to do that from SQL.
--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Unexplained table bloat

Keith Medcalf
In reply to this post by Dominique Devienne

On Friday, 10 January, 2020 10:50, Dominique Devienne <[hidden email]>:

>On Fri, Jan 10, 2020 at 4:30 PM Richard Hipp <[hidden email]> wrote:

>> length() on a BLOB should show the number of bytes in the BLOB.

>> length() on a string should show the number of *characters* (not
>> bytes) in the string up through but not including the first
>> zero-character.  It is possible to have additional content after the
>> first zero-character in a string, which length() will not tell you
>> about.

> There's no way at all, to know the length of a text column with embedded
> NULLs?

C-Strings cannot have embedded nulls.  The very definition of a C-String is "a sequence of non-zero characters followed by a zero character".  So while you can store and retrieve invalid C-Strings in the database (as in use BIND and COLUMN_TEXT), "things" (meaning software) which expects a C-String to be a C-String will be confused by such improper usage of a C String, and "things" which expect "text" fields to contain properly encoded C-Strings are likely to have brain-seizures.

You can, of course, cast the column as a blob (which IS allowed to have embedded nulls and DOES NOT have to have valid text encoding), and get the length of that bag-o-bytes (in bytes):

sqlite> select * from copied;
85960605|,|0DE19F8AA100D2962FF22C60B3FA8940
85960465|,|A44615408E8B3E48A684C60CA2967031
85960451|,|33C8804354984071A740788AD308B279
sqlite> select typeof(id_local),typeof(lutfullstring),typeof(luthash) from copied;
integer|text|text
integer|text|text
integer|text|text
sqlite> select length(cast(lutfullstring as blob)) from copied;
194238
183050
193908

The first select is indeed selecting the ENTIRE string that has been asked for.  It is then using printf "%s" to print it because it is SUPPOSED TO BE a null terminated string.  It is doing exactly what it has been told to do.  Similarly, the function LENGTH() on what is supposed to be a NULL terminated string returns the number of characters up to but not including the NULL terminator.

The real problem here is that a BLOB has been stored as if it were TEXT.  This is a failure of the application to properly sanitize its input.

--
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Unexplained table bloat

Tim Streater-3
In reply to this post by Richard Hipp-3
On 10 Jan 2020, at 18:03, Richard Hipp <[hidden email]> wrote:

> On 1/10/20, Dominique Devienne <[hidden email]> wrote:
>>
>> There's no way at all, to know the length of a text column with embedded
>> NULLs?
>>
>
> You can find the true length of a string in bytes from C-code using
> the sqlite3_column_bytes() interface.  But I cannot, off-hand, think
> of a way to do that from SQL.

But if I store UTF-8 in a TEXT column, surely I'm allowed to include NULLs in that? They are after all valid UTF-8 characters.



--
Cheers  --  Tim
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Unexplained table bloat

Keith Medcalf

On Friday, 10 January, 2020 11:44, Tim Streater <[hidden email]> wrote:

>On 10 Jan 2020, at 18:03, Richard Hipp <[hidden email]> wrote:

>> On 1/10/20, Dominique Devienne <[hidden email]> wrote:
>>> There's no way at all, to know the length of a text column with
>>> embedded NULLs?

>> You can find the true length of a string in bytes from C-code using
>> the sqlite3_column_bytes() interface.  But I cannot, off-hand, think
>> of a way to do that from SQL.

>But if I store UTF-8 in a TEXT column, surely I'm allowed to include
>NULLs in that? They are after all valid UTF-8 characters.

No, they are not.  The "NUL character" in Modified UTF-8 is the two-byte sequence 0xC0 0x80.  This is specifically so that 0x00 can be used as a string terminator.  Validly encoded UTF-8 encoded text stored in a C String (0x00 terminated sequence of bytes) must not contain an embedded 0x00 byte since that byte terminates the sequence.

--
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.




_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Unexplained table bloat

Richard Damon
In reply to this post by Tim Streater-3
On 1/10/20 1:43 PM, Tim Streater wrote:

> On 10 Jan 2020, at 18:03, Richard Hipp <[hidden email]> wrote:
>
>> On 1/10/20, Dominique Devienne <[hidden email]> wrote:
>>> There's no way at all, to know the length of a text column with embedded
>>> NULLs?
>>>
>> You can find the true length of a string in bytes from C-code using
>> the sqlite3_column_bytes() interface.  But I cannot, off-hand, think
>> of a way to do that from SQL.
> But if I store UTF-8 in a TEXT column, surely I'm allowed to include NULLs in that? They are after all valid UTF-8 characters.
>
>
>
As has been said, C Strings (which is what TEXT is assumed to hold) are
not allowed to include null characters, but are assumed to terminate at
the first 0 bytes. There is a variant of UTF-8, called modified UTF-8 or
MUTF-8, which allows a null character to be encoded as C0 80, which does
decode to 0 by the base UTF-8 rules, but is disallowed by the minimum
encoding rule, which can be used to embed nulls in strings if the system
doesn't enforce the minimum length encoding rule (at least for this
character).

I have no idea if that would work with SQLite though.

--
Richard Damon

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Unexplained table bloat

Tim Streater-3
In reply to this post by Keith Medcalf
On 10 Jan 2020, at 18:55, Keith Medcalf <[hidden email]> wrote:

> On Friday, 10 January, 2020 11:44, Tim Streater <[hidden email]> wrote:
>
>>On 10 Jan 2020, at 18:03, Richard Hipp <[hidden email]> wrote:
>
>>> On 1/10/20, Dominique Devienne <[hidden email]> wrote:
>>>> There's no way at all, to know the length of a text column with
>>>> embedded NULLs?
>
>>> You can find the true length of a string in bytes from C-code using
>>> the sqlite3_column_bytes() interface. But I cannot, off-hand, think
>>> of a way to do that from SQL.
>
>>But if I store UTF-8 in a TEXT column, surely I'm allowed to include
>>NULLs in that? They are after all valid UTF-8 characters.
>
> No, they are not. The "NUL character" in Modified UTF-8 is the two-byte
> sequence 0xC0 0x80. This is specifically so that 0x00 can be used as a string
> terminator. Validly encoded UTF-8 encoded text stored in a C String (0x00
> terminated sequence of bytes) must not contain an embedded 0x00 byte since
> that byte terminates the sequence.

Nice, but Wikipedia has that as a "Derivative" and "incompatible with the UTF-8 specification and may be rejected by conforming UTF-8 applications." It appears (though I may have missed it) not to be mentioned on this handy site either:

https://www.utf8-chartable.de/unicode-utf8-table.pl

I shall have to check what my preferred language's wrapper does.


--
Cheers  --  Tim
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Unexplained table bloat

Richard Damon
On 1/10/20 2:24 PM, Tim Streater wrote:

> On 10 Jan 2020, at 18:55, Keith Medcalf <[hidden email]> wrote:
>
>> On Friday, 10 January, 2020 11:44, Tim Streater <[hidden email]> wrote:
>>
>>> On 10 Jan 2020, at 18:03, Richard Hipp <[hidden email]> wrote:
>>>> On 1/10/20, Dominique Devienne <[hidden email]> wrote:
>>>>> There's no way at all, to know the length of a text column with
>>>>> embedded NULLs?
>>>> You can find the true length of a string in bytes from C-code using
>>>> the sqlite3_column_bytes() interface. But I cannot, off-hand, think
>>>> of a way to do that from SQL.
>>> But if I store UTF-8 in a TEXT column, surely I'm allowed to include
>>> NULLs in that? They are after all valid UTF-8 characters.
>> No, they are not. The "NUL character" in Modified UTF-8 is the two-byte
>> sequence 0xC0 0x80. This is specifically so that 0x00 can be used as a string
>> terminator. Validly encoded UTF-8 encoded text stored in a C String (0x00
>> terminated sequence of bytes) must not contain an embedded 0x00 byte since
>> that byte terminates the sequence.
> Nice, but Wikipedia has that as a "Derivative" and "incompatible with the UTF-8 specification and may be rejected by conforming UTF-8 applications." It appears (though I may have missed it) not to be mentioned on this handy site either:
>
> https://www.utf8-chartable.de/unicode-utf8-table.pl
>
> I shall have to check what my preferred language's wrapper does.
>
>
It is incompatible, in the sense that it uses an encoding that the UTF-8
specification says in invalid, and thus an application that performs
fully all the tests on valid data forms would reject it. In many ways it
is a compatible extension in that excluding the test that specifically
makes the form invalid, doing the processing by the general rules of
UTF-8, gives the expected result.

C Strings do not allow 0 bytes in them. This would normally mean that
they do not allow the NUL character to be in a string. This extension
allows a character which would be interpreted as the NUL character to be
represented without needing a 0 byte.

It should be pointed out that most libraries won't be checking all the
strings that pass through them to see if they violate the rule, as that
is just adding a lot of overhead for very little benefit. It is really
expected that applications will do this sort of test at the borders,
when possibly untrusted strings come in, and know that if good strings
come in, the following processing will keep the strings valid.

--
Richard Damon

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Unexplained table bloat

Kevin-272
In reply to this post by Ryan Mack

Hi,

I checked the download mentioned in the original email. Not sure if the
table changed since the previous posts.


It seems LUTFullString has 3 BLOB rows, but LENGTH treats them as
strings.

I'm in Melbourne, Oz, so I added the UTC datetime.

regs, Kev

kevin@KCYDell:~$ cd /mnt/KCY/KCYDocs/
kevin@KCYDell:/mnt/KCY/KCYDocs$ sqlite3 /mnt/KCY/KCYDocs/200k-per-
row.sqlite
SQLite version 3.30.1 2019-10-10 20:19:45
Enter ".help" for usage hints.
sqlite> .schema
CREATE TABLE copied(
  id_local INT,
  LUTFullString,
  LUTHash
);
sqlite>
sqlite> select '1',rowid,LENGTH(HEX(LUTFullString))/2 from copied
   ...> UNION ALL
   ...> select '2',rowid,INSTR(HEX(LUTFullString),'00') from copied
   ...> UNION ALL
   ...> select '3',rowid,substr(HEX(LUTFullString),0,10) from copied
   ...> UNION ALL
   ...> select '4',rowid,INSTR(SUBSTR(HEX(LUTFullString),4),'00') from
copied
   ...> UNION ALL
   ...> select '5',rowid,LENGTH(LUTFullString) from copied;
1|1|194238
1|2|183050
1|3|193908
2|1|3
2|2|3
2|3|3
3|1|2C0003007
3|2|2C0003007
3|3|2C0003007
4|1|1
4|2|1
4|3|1
5|1|1
5|2|1
5|3|1
sqlite> .quit
kevin@KCYDell:/mnt/KCY/KCYDocs$ date -u
Sat 11 Jan 23:39:43 UTC 2020
kevin@KCYDell:/mnt/KCY/KCYDocs$









Message: 6
Date: Fri, 10 Jan 2020 08:48:21 -0500
From: Ryan Mack <[hidden email]>
To: [hidden email]
Subject: [sqlite] Unexplained table bloat
Message-ID:
        <
CABhGdGRbR1kT+3_BU6ob9L7tpSPZ09HJn=ofPyK6OXvgQK=[hidden email]>
Content-Type: text/plain; charset="UTF-8"

Hi list,

I'm trying to understand unexplained table bloat I found in what should
be
a very small table in an old database file. If you dump/restore the
database, the problem goes away. If you duplicate the table, the
problem
propagates. Schema:

CREATE TABLE copied(
  id_local INT,
  LUTFullString,
  LUTHash
);

Test Data:
85960605|,|0DE19F8AA100D2962FF22C60B3FA8940
85960465|,|A44615408E8B3E48A684C60CA2967031
85960451|,|33C8804354984071A740788AD308B279

In my test database file, each of these 3 rows is allocating about 40
4k
overflow pages for a total database size of about 500k. The full
database
has about 4MB of actual data which takes up over 500MB on disk. If you
want
to see/reproduce the problem you'll need my test database file which
I've
uploaded here: https://mackman.net/200k-per-row.sqlite.zip (500kb
download). I don't know why it doesn't compress better, those extra
overflow pages must be getting filled with random garbage.

My uninformed guess is there was a bug in the version of sqlite used at

database creation time that computed an incorrect overflow threshold
and is
storing each byte of the row to its own page. Since the problem goes
away
with a dump/restore, I'm considering releasing a script to do that and
mitigate the problem for affected users. Before doing that I would like
to
understand the problem better.

Thanks for reading, Ryan

PS: Here's some output from my debug session showing the 123 bytes of
data
is occupying 582k of space on disk in a freshly created table.

% sqlite3 200k-per-row.sqlite
SQLite version 3.28.0 2019-04-15 14:49:49
Enter ".help" for usage hints.
sqlite> .tables
copied

sqlite> .schema copied
CREATE TABLE copied(
  id_local INT,
  LUTFullString,
  LUTHash
);

sqlite> select * from copied;
85960605|,|0DE19F8AA100D2962FF22C60B3FA8940
85960465|,|A44615408E8B3E48A684C60CA2967031
85960451|,|33C8804354984071A740788AD308B279

sqlite> select sum(length(id_local) + length(lutfullstring) +
length(luthash)) from copied;
123

sqlite> create table copied2 as select * from copied;
sqlite> SELECT name, SUM("pgsize") as sz FROM dbstat group by name;
copied|581632
copied2|581632
sqlite_master|4096



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Unexplained table bloat

Tom Browder
Am I missing something? I thought every column has to have a type?

-Tom
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Unexplained table bloat

Keith Medcalf

On Sunday, 12 January, 2020 09:03, Tom Browder <[hidden email]> wrote:

>Am I missing something? I thought every column has to have a type?

Close, but no banana.  Every value has a type.  A column may contain multiple values (as in one per row).  Therefore each of those values has a type, which may be different from the type of the value in the same column on another row.  Columns have an affinity, which is a preference for the type of the value to be stored in that columns' rows if conversion from the value provided to be stored to the specified affinity is possible, lossless, and reversible.

https://sqlite.org/datatype3.html

--
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
123