Announcing an extension for fast, constant-memory cardinality approximation
as I side-project I put together an implementation of Hyperminhash as an
extension for SQLite3. It provides fast, constant-memory cardinality
approximation, including the union- and intersection-set.
For example, querying an in-memory table of two million (INT, INT)-rows,
having no indexes, the query `SELECT COUNT(*) FROM (SELECT DISTINCT foo, bar
FROM foobar)` completes in close to five seconds. The equivalent query
`SELECT hyperminhash(foo, bar) FROM foobar` completes in less than 400ms and
deviates from the true result by less than 0.5%.
The Hyperminhash-functions shine when indexes can't be used and there are
lots of unique elements, requiring sqlite to build large temporary tables to
hold unique elements in traditional `DISTINCT`-queries. There is also
significant memory savings, as the data structure used by Hyperminhash is
just 32kb and never grows while counting unique elements.