dbhash collision

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

dbhash collision

Nathan Wagner
I am working up code to calculate a hash over parts of the data in an sqlite
database, and as a start looked at the dbhash.c code found at

https://www.sqlite.org/src/artifact?ci=trunk&filename=tool/dbhash.c

I don't think the code as is works correctly, it is easy to construct
a hash collision between two different databases.

The code to hash a text value, merely prefixes the text string with a
'3', so if you have adjacent columns with '3's, they can hash the same.

A demonstration script follows:

#!/bin/sh

schema="create table hashc(a text, b text);"

sqlite3 hca.db "$schema"
sqlite3 hca.db "insert into hashc values ('3333', '3333')"

sqlite3 hcb.db "$schema"
sqlite3 hcb.db "insert into hashc values ('33333', '333')"

./dbhash hca.db hcb.db

rm hca.db hcb.db

On my system, I get:

granicus% sh hashc    
2d07f37961253ec993424d0997caf5fb5aede448 hca.db
2d07f37961253ec993424d0997caf5fb5aede448 hcb.db
granicus%

This obviously isn't what is wanted, even if it is unlikely to arise in
practice.  I believe a simple fix would be to, in addition to the type
identifer of '3' for a text column, to hash the string length as some
integer type.  I would be happy to provide example code if that would be
helpful, but it's really just a strlen() and then hashing a "3", the
result of the strlen as a four or eight byte integer, which can be done
similarly to how an integer column would be hashed, and then the string
content.

For the example above, the rows would hash as
("3", 4, "3333", "3", 4, "3333") and ("3", 5, "33333", "3", 3, "333")
whereas now, they hash as
("3", "3333", "3" "3333") and ("3", "33333", "3", "333")
which are the same.

Hashing the string length as well as the data makes the rows different,
and I think it would be sufficient to make different rows unique.

--
nw
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: dbhash collision

Richard Hipp-3
On 9/25/18, Nathan Wagner <[hidden email]> wrote:
> I am working up code to calculate a hash over parts of the data in an sqlite
> database, and as a start looked at the dbhash.c code found at

Consider instead using one of these:

    https://www.sqlite.org/src/file/ext/misc/sha1.c
    https://www.sqlite.org/src/file/ext/misc/shathree.c

The ".sha3sum" command in the command-line shell[1] uses shathree.c.
That hasher does include a length on both text and blob fields.  See
the description at

    https://www.sqlite.org/src/artifact/9e960ba5048?ln=552-575

[1] https://www.sqlite.org/cli.html

--
D. Richard Hipp
[hidden email]
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: dbhash collision

nomad
On Tue Sep 25, 2018 at 09:48:27AM -0400, Richard Hipp wrote:
> On 9/25/18, Nathan Wagner <[hidden email]> wrote:
> > I am working up code to calculate a hash over parts of the data in an sqlite
> > database, and as a start looked at the dbhash.c code found at
>
> Consider instead using one of these:
>
>     https://www.sqlite.org/src/file/ext/misc/sha1.c
>     https://www.sqlite.org/src/file/ext/misc/shathree.c

The first comment line in the sha1three.c file mentions SHA1 instead of
SHA3 (and has a grammar issue):

    This SQLite extension implements a functions that compute SHA1
    hashes.

Perhaps "implements functions" that "compute SHA1 hashes"?  It was
obviously copied from the sha1.c file which has the same grammar issue.

--
Mark Lawrence
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users