SQLITE_MAX_COLUMN should be a runtime knob for Machine Learning

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

SQLITE_MAX_COLUMN should be a runtime knob for Machine Learning

Dan Kaminsky
Sqlite3 has something of a normative declaration in its source code:

*
** This is the maximum number of
**
**    * Columns in a table
**    * Columns in an index
**    * Columns in a view
**    * Terms in the SET clause of an UPDATE statement
**    * Terms in the result set of a SELECT statement
**    * Terms in the GROUP BY or ORDER BY clauses of a SELECT statement.
**    * Terms in the VALUES clause of an INSERT statement
**
** The hard upper limit here is 32676.  Most database people will
** tell you that in a well-normalized database, you usually should
** not have more than a dozen or so columns in any table.  And if
** that is the case, there is no point in having more than a few
** dozen values in any of the other situations described above.
*/
#ifndef SQLITE_MAX_COLUMN
# define SQLITE_MAX_COLUMN 2000
#endif

All software has constraints fundamental to its problem set and particular
implementation, and I would not mail you simply for deciding an
architecture on behalf of your users.  You do that every day.

However, conditions have changed since (I expect) this design was
specified.  One of the more useful and usable packages for Natural Language
Processing, Magnitude[1], leverages SQLite to efficiently handle the real
valued but entirely abstract collections of numbers -- vector spaces --
that modern machine learning depends on.

Traditionally, word2vec and other language vector approaches have fit their
models into tens or hundreds of dimensions.  New work is acquiring the
context of each word -- "Huntington", preceeded by "Mr.", vs. "Huntington"
preceeded by "Street" or followed by Disease.  These phrase or sentence
embeddings (as they're called) can take quite a bit more space to represent
necessary context.  They may take 3072 columns (in the case of Magnitude's
ELMo records[2]) or even 4096 columns (in the case of InferSent's sentence
embeddings[3]).

That is exceeding SQLite's limits.

These spaces, though abstract, have become the most powerful way we know to
not merely represent, but actually discover relationships.  This is simply
a new data domain that is the *input* to what eventually becomes the
familiarly normalizable relational domain.  It's different, but it's not
"wrong".  If it's behavior you can support -- as your 32K hard limit
implies -- it would certainly be helpful for these user scenarios.

I spent quite a bit of time hacking large column support into a working
Python pipeline, and I'd prefer never to run that in production.
Converting this compile time variable into a runtime knob would be
appreciated.

--Dan

[1] https://github.com/plasticityai/magnitude
[2] https://github.com/plasticityai/magnitude/blob/master/ELMo.md
[3] https://github.com/plasticityai/magnitude/blob/master/ELMo.md
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: SQLITE_MAX_COLUMN should be a runtime knob for Machine Learning

Simon Slavin-3
On 15 Jun 2019, at 2:42pm, Dan Kaminsky <[hidden email]> wrote:

[about the 32676 hard limit on the number of columns in a table]

> I spent quite a bit of time hacking large column support into a working
> Python pipeline, and I'd prefer never to run that in production.
> Converting this compile time variable into a runtime knob would be
> appreciated.

Something you should know about SQLite is that if it needs to find the 2001st column of a row it has to read the entire row from storage and walk through all 2000 columns before the one it wants.  So both storing and recalling data in wide tables is very inefficient.

To compensate for this problem, which occurs in many SQL engines, you can turn your wide table into a thin table (key/value pairs) by adding the column name to the key.  SQLite is extremely good at handling tall thin tables.

If you think about what you're really doing with your data you're find that although it's classically drawn out as a huge 2D grid, the data is closer to an Entity–attribute–value model, and more suited to a tall table with a long key.

There's no reason why your library should have to know how SQLite is used to store data.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: SQLITE_MAX_COLUMN should be a runtime knob for Machine Learning

Dan Kaminsky
In reply to this post by Dan Kaminsky
As it happens, there is in fact a formal analytical technique that yields
appropriate numbers of "well normalized" dimensions for word embeddings:

https://github.com/ziyin-dl/word-embedding-dimensionality-selection
<https://github.com/ziyin-dl/word-embedding-dimensionality-selection?fbclid=IwAR2vMeDqMhUqZlQvLj2tpdMCmR09UUpXSnryS73NEG4R37LLWu0fXUoreFc>

That's not to say that Machine Learning requirements should drive SQLite,
but the ask is merely to expose a knob with an already defined higher
capacity.

Also, the correct link for [3] above is here:

https://github.com/facebookresearch/InferSent




On Sat, Jun 15, 2019 at 6:42 AM Dan Kaminsky <[hidden email]> wrote:

> Sqlite3 has something of a normative declaration in its source code:
>
> *
> ** This is the maximum number of
> **
> **    * Columns in a table
> **    * Columns in an index
> **    * Columns in a view
> **    * Terms in the SET clause of an UPDATE statement
> **    * Terms in the result set of a SELECT statement
> **    * Terms in the GROUP BY or ORDER BY clauses of a SELECT statement.
> **    * Terms in the VALUES clause of an INSERT statement
> **
> ** The hard upper limit here is 32676.  Most database people will
> ** tell you that in a well-normalized database, you usually should
> ** not have more than a dozen or so columns in any table.  And if
> ** that is the case, there is no point in having more than a few
> ** dozen values in any of the other situations described above.
> */
> #ifndef SQLITE_MAX_COLUMN
> # define SQLITE_MAX_COLUMN 2000
> #endif
>
> All software has constraints fundamental to its problem set and particular
> implementation, and I would not mail you simply for deciding an
> architecture on behalf of your users.  You do that every day.
>
> However, conditions have changed since (I expect) this design was
> specified.  One of the more useful and usable packages for Natural Language
> Processing, Magnitude[1], leverages SQLite to efficiently handle the real
> valued but entirely abstract collections of numbers -- vector spaces --
> that modern machine learning depends on.
>
> Traditionally, word2vec and other language vector approaches have fit
> their models into tens or hundreds of dimensions.  New work is acquiring
> the context of each word -- "Huntington", preceeded by "Mr.", vs.
> "Huntington" preceeded by "Street" or followed by Disease.  These phrase or
> sentence embeddings (as they're called) can take quite a bit more space to
> represent necessary context.  They may take 3072 columns (in the case of
> Magnitude's ELMo records[2]) or even 4096 columns (in the case of
> InferSent's sentence embeddings[3]).
>
> That is exceeding SQLite's limits.
>
> These spaces, though abstract, have become the most powerful way we know
> to not merely represent, but actually discover relationships.  This is
> simply a new data domain that is the *input* to what eventually becomes the
> familiarly normalizable relational domain.  It's different, but it's not
> "wrong".  If it's behavior you can support -- as your 32K hard limit
> implies -- it would certainly be helpful for these user scenarios.
>
> I spent quite a bit of time hacking large column support into a working
> Python pipeline, and I'd prefer never to run that in production.
> Converting this compile time variable into a runtime knob would be
> appreciated.
>
> --Dan
>
> [1] https://github.com/plasticityai/magnitude
> [2] https://github.com/plasticityai/magnitude/blob/master/ELMo.md
> [3] https://github.com/plasticityai/magnitude/blob/master/ELMo.md
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: SQLITE_MAX_COLUMN should be a runtime knob for Machine Learning

Amirouche Boubekki
In reply to this post by Simon Slavin-3
Le sam. 15 juin 2019 à 20:29, Simon Slavin <[hidden email]> a écrit :

> On 15 Jun 2019, at 2:42pm, Dan Kaminsky <[hidden email]> wrote:
>
> [about the 32676 hard limit on the number of columns in a table]
>
> > I spent quite a bit of time hacking large column support into a working
> > Python pipeline, and I'd prefer never to run that in production.
> > Converting this compile time variable into a runtime knob would be
> > appreciated.
>
> Something you should know about SQLite is that if it needs to find the
> 2001st column of a row it has to read the entire row from storage and walk
> through all 2000 columns before the one it wants.  So both storing and
> recalling data in wide tables is very inefficient.
>
> To compensate for this problem, which occurs in many SQL engines, you can
> turn your wide table into a thin table (key/value pairs) by adding the
> column name to the key.  SQLite is extremely good at handling tall thin
> tables.
>
> If you think about what you're really doing with your data you're find
> that although it's classically drawn out as a huge 2D grid, the data is
> closer to an Entity–attribute–value model, and more suited to a tall table
> with a long key.
>

Isn't this a use-case of LSM extension?

Thanks,


Amirouche
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: SQLITE_MAX_COLUMN should be a runtime knob for Machine Learning

Jens Alfke-2
In reply to this post by Dan Kaminsky


> On Jun 15, 2019, at 6:42 AM, Dan Kaminsky <[hidden email]> wrote:
>
> One of the more useful and usable packages for Natural Language
> Processing, Magnitude[1], leverages SQLite to efficiently handle the real
> valued but entirely abstract collections of numbers -- vector spaces --
> that modern machine learning depends on.

I'm somewhat familiar with this, having recently written some code that stores ML results in SQLite databases.

As far as I know, there is no benefit to storing each element of such a vector as a separate column in SQLite. Instead, the entire vector should be stored as a single blob — for example, as a concatenation of 3072 IEEE floats in some fixed byte-order.

I say this because I don't know of any reason why a SQL query would need to access a specific vector coordinate, e.g. "SELECT * FROM vectors WHERE item_1722 > 0.5". The interesting operations on these vectors apply to all the coordinates in aggregate, like dot products or various distance metrics, and writing these out in SQL would be extremely verbose and extremely slow (because the query engine is an interpreter.) Instead you'd implement a native function that used CPU vector instructions to do the math quickly, and register that function with SQLite, passing it the vector as a blob.

—Jens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: SQLITE_MAX_COLUMN should be a runtime knob for Machine Learning

Simon Slavin-3
In reply to this post by Amirouche Boubekki
On 16 Jun 2019, at 7:35pm, Amirouche Boubekki <[hidden email]> wrote:

> Isn't this a use-case of LSM extension?

It would seem a very good thing to do using LSM, but I can find documentation for LSM only in SQLite4, not SQLite3.

I did find this:

<https://news.ycombinator.com/item?id=15649107>

" I've had the chance to hear Richard Hipp talk about SQLite yesterday! He mentioned that the LSM tree storage engine is available as an extension to sqlite3. More specifically, he mentioned that he didn't really get the performance improvements he had hoped for, for insertion-heavy use cases. "

The rest of that page includes many details.  All hail _Hacker News_.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: SQLITE_MAX_COLUMN should be a runtime knob for Machine Learning

Simon Slavin-3
In reply to this post by Jens Alfke-2
On 16 Jun 2019, at 7:54pm, Jens Alfke <[hidden email]> wrote:

> As far as I know, there is no benefit to storing each element of such a vector as a separate column in SQLite. Instead, the entire vector should be stored as a single blob — for example, as a concatenation of 3072 IEEE floats in some fixed byte-order.

I think that trumps the discussion.  If Jens is right, and there's no point in extracting the contents of a single column, just use a single BLOB or TEXT column and store everything in it.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: SQLITE_MAX_COLUMN should be a runtime knob for Machine Learning

Dominique Devienne
In reply to this post by Simon Slavin-3
On Sun, Jun 16, 2019 at 9:02 PM Simon Slavin <[hidden email]> wrote:

> On 16 Jun 2019, at 7:35pm, Amirouche Boubekki <
> [hidden email]> wrote:
>
> > Isn't this a use-case of LSM extension?
>
> It would seem a very good thing to do using LSM, but I can find
> documentation for LSM only in SQLite4, not SQLite3.
>

SQLite4's LSM backend is now an extension in SQLite3 called LSM1
https://www.sqlite.org/cgi/src/dir?ci=trunk&name=ext/lsm1&type=tree

Which has been discussed in this list before.
Few people are using it it seems, given the low volume of questions. --DD
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: SQLITE_MAX_COLUMN should be a runtime knob for Machine Learning

Simon Slavin-3
On 17 Jun 2019, at 9:14am, Dominique Devienne <[hidden email]> wrote:

> SQLite4's LSM backend is now an extension in SQLite3 called LSM1
> https://www.sqlite.org/cgi/src/dir?ci=trunk&name=ext/lsm1&type=tree
>
> Which has been discussed in this list before.
> Few people are using it it seems, given the low volume of questions.

This may be because it's not documented.  As far as I can see, the only way to discover it is to read the source code.  Thanks for the pointer.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: SQLITE_MAX_COLUMN should be a runtime knob for Machine Learning

Amirouche Boubekki
Le lun. 17 juin 2019 à 14:36, Simon Slavin <[hidden email]> a écrit :

> On 17 Jun 2019, at 9:14am, Dominique Devienne <[hidden email]> wrote:
>
> > SQLite4's LSM backend is now an extension in SQLite3 called LSM1
> > https://www.sqlite.org/cgi/src/dir?ci=trunk&name=ext/lsm1&type=tree
> >
> > Which has been discussed in this list before.
> > Few people are using it it seems, given the low volume of questions.
>
> This may be because it's not documented.


Also it is difficult to compile.


> As far as I can see, the only way to discover it is to read the source
> code.  Thanks for the pointer.
>

The sqlite4 documentation is accurate.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users