Deterministic random sampling via SELECT

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Deterministic random sampling via SELECT

Merijn Verstraaten
I'm trying sample a (deterministically) random subset of a SELECT query, the most common solution on the internet to get random samples seems to be "SELECT * FROM (...) ORDER BY RANDOM() LIMIT n;" (this already has some question marks, since it relies on seeding RANDOM and knowing the RANDOM function is always evaluated in the same order every query run), but looking at the query plans this materialises the entire result set in memory for the sort (no surprise, I can't think of anyway that could work otherwise) which is rather undesirable if the sample size becomes large (i.e. several million rows).

Now, I already know different ways to implement a predicate function that can deterministically keep elements from a stream, however that relies on having a deterministic order for the stream. Which brings us to SQLite. I can easily write something like:

SELECT *
FROM (...)
WHERE my_predicate_fun()
ORDER BY column1, column2,...

And this *seems* to evaluate the where clause for each row in the order determined by ORDER BY, but this doesn't seem at all guaranteed by the SQL spec. So, is this behaviour documented/guaranteed somewhere? If not, is there some way to guarantee my where clause is evaluated for each row in a deterministic order?

In the simple case like above I could always just evaluate the query without the ORDER BY, step through the entire query, and evaluate the predicate in the application, but if I want to use this random selection as a subquery, then that doesn't work.

And while I'm asking questions: What if I want to do the above, but selecting groups of rows? So, sort of like:

SELECT *
FROM (...)
GROUP BY groupColumn
HAVING my_predicate_fun();

But where I want to return all rows in the group, rather than an aggregate.

Thanks in advance,
Merijn
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Deterministic random sampling via SELECT

David Raymond
"So, is this behaviour documented/guaranteed somewhere?"
Short version is: Nope. The engine is free to do whatever it wants as long as it gives the correct result in the end.

Consider a simple
select * from foo where predicate order by non_indexed_field;
Since there is no nice ordering of the data already it's going to have to sort it. In which case it's probably going to check the predicate against records on the way _in_ to the sorter rather than _after_ sorting. Think "I might only have to sort 5 things instead of 5 million, so let's filter out as much as we can as soon as possible." And since the same data could be on the disk with its pages in any order you could conceive a situation where the same data could be processed differently depending on the specific file that it's in. The same query could process the data in a different order before and after a vacuum for example.

Or maybe it does sort first then check. But that's an internal detail which could change every version. And all that is all considered fine, as the end result of the query will still be correct and in the order specified.

The closest thing you can do is limit a query to using a specific index during a query, but even then you're basically relying on implementation details, and not a guarantee.

Others will correct me if I'm wrong on that.



-----Original Message-----
From: sqlite-users <[hidden email]> On Behalf Of Merijn Verstraaten
Sent: Thursday, November 7, 2019 6:55 AM
To: [hidden email]
Subject: [sqlite] Deterministic random sampling via SELECT

I'm trying sample a (deterministically) random subset of a SELECT query, the most common solution on the internet to get random samples seems to be "SELECT * FROM (...) ORDER BY RANDOM() LIMIT n;" (this already has some question marks, since it relies on seeding RANDOM and knowing the RANDOM function is always evaluated in the same order every query run), but looking at the query plans this materialises the entire result set in memory for the sort (no surprise, I can't think of anyway that could work otherwise) which is rather undesirable if the sample size becomes large (i.e. several million rows).

Now, I already know different ways to implement a predicate function that can deterministically keep elements from a stream, however that relies on having a deterministic order for the stream. Which brings us to SQLite. I can easily write something like:

SELECT *
FROM (...)
WHERE my_predicate_fun()
ORDER BY column1, column2,...

And this *seems* to evaluate the where clause for each row in the order determined by ORDER BY, but this doesn't seem at all guaranteed by the SQL spec. So, is this behaviour documented/guaranteed somewhere? If not, is there some way to guarantee my where clause is evaluated for each row in a deterministic order?

In the simple case like above I could always just evaluate the query without the ORDER BY, step through the entire query, and evaluate the predicate in the application, but if I want to use this random selection as a subquery, then that doesn't work.

And while I'm asking questions: What if I want to do the above, but selecting groups of rows? So, sort of like:

SELECT *
FROM (...)
GROUP BY groupColumn
HAVING my_predicate_fun();

But where I want to return all rows in the group, rather than an aggregate.

Thanks in advance,
Merijn
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Deterministic random sampling via SELECT

Keith Medcalf
In reply to this post by Merijn Verstraaten
On Thursday, 7 November, 2019 04:55, Merijn Verstraaten <[hidden email]> wrote:

>I'm trying sample a (deterministically) random subset of a SELECT query,

You cannot have something that is both RANDOM and DETERMINISTIC at the same time.

>the most common solution on the internet to get random samples seems to
>be "SELECT * FROM (...) ORDER BY RANDOM() LIMIT n;" (this already has
>some question marks, since it relies on seeding RANDOM and knowing the
>RANDOM function is always evaluated in the same order every query run),

This must be from StackOverflow because it is a terrible idea.

>but looking at the query plans this materialises the entire result set in
>memory for the sort (no surprise, I can't think of anyway that could work
>otherwise) which is rather undesirable if the sample size becomes large
>(i.e. several million rows).

Exactly.

>Now, I already know different ways to implement a predicate function that
>can deterministically keep elements from a stream, however that relies on
>having a deterministic order for the stream. Which brings us to SQLite. I
>can easily write something like:

>SELECT *
>FROM (...)
>WHERE my_predicate_fun()
>ORDER BY column1, column2,...

>And this *seems* to evaluate the where clause for each row in the order
>determined by ORDER BY, but this doesn't seem at all guaranteed by the
>SQL spec.

That is correct.  ORDER BY is executed as the LAST STEP in returning the query results.  However, based on availability of indexes, the optimizer my decide to utilize those indexes in order to obtain results in already sorted or partially sorted order.  The various WHERE clause terms will be checked at the appropriate place to minimize the number of candidates at each step.  Since my_predicate_fun() is not dependent on any item it will be executed at the last step on each candidate (that has passed all the other where clause constraints) before that candidate is passed to the sorter.

>So, is this behaviour documented/guaranteed somewhere? If not,
>is there some way to guarantee my where clause is evaluated for each row
>in a deterministic order?

Yes, and no.  For a given (constant) query of a given set of (constant) data with a given set of (constant) indexes on that data, the (same version) optimizer will always arrive at the same method of answering your query.  If you change the query or the data or the indexes available or the version of the optimizer, then the most efficient plan for obtaining the results for which you asked will be used, which will thereafter be constant.  That is, if the input is constant, and the software is constant, then the output and the how the software arrives at that output, will be constant.

See https://www.sqlite.org/queryplanner-ng.html#qpstab

You might want to read the whole page.

>In the simple case like above I could always just evaluate the query
>without the ORDER BY, step through the entire query, and evaluate the
>predicate in the application, but if I want to use this random selection
>as a subquery, then that doesn't work.

You have changed the query.  Each query is optimized individually.

>And while I'm asking questions: What if I want to do the above, but
>selecting groups of rows? So, sort of like:

>SELECT *
>FROM (...)
>GROUP BY groupColumn
>HAVING my_predicate_fun();

>But where I want to return all rows in the group, rather than an
>aggregate.

Then you must use the group result to select the data.  Use of GROUP BY/HAVING implies returning GROUPs, not rows.  You can of course use the selected groups to find the matching members of the group and return those.  Example:

select <stuff>
  from <places>
  join (select a, b
          from <places>
         where <conditions>
      group by a, b
        having my_predicate_function()) as g
    on <place>.a = g.a AND <place>.b = g.b
 where <conditions>
;

If you wish to generate a "static random sample" of the data, then why do you not just do it once and record the result, thereafter using that keyset to re-retrieve the same data whenever you need it?  Eg:

create table KeySet as
select a.RowID as aRowID, b.RowID as bRowID, c.RowID as cRowID
  from a, b, c
 where <conditions>;

and thereafter

select <stuff>
  from KeySet
  join a on a.rowid == KeySet.aRowID
  join b on b.rowid == KeySet.bRowID
  join c on c.rowid == KeySet.cRowID

--
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Deterministic random sampling via SELECT

Simon Slavin-3
In reply to this post by David Raymond
On 7 Nov 2019, at 1:56pm, David Raymond <[hidden email]> wrote:

> Others will correct me if I'm wrong on that.

No correction, but I wanted to add something.

According to the theory of how SQL (not just SQLite, SQL) works, tables have no order.  You can, in theory, query a table of 100 rows with

SELECT a,b FROM MyTable LIMIT 5

ten times and get ten different answers, including different rows and/or the same rows in different orders.  Given some of the text in the post that started this thread, I just wanted to make sure this was understood.

In practise I have never seen a SQL engine which does this.  Each SQL implementation seems to return the same result every time you repeat the same query.  Though different SQL implementations can return different results.

You'd have thought that at least one server/client system would return five rows which happened to be in the cache, but I've never seen this.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Deterministic random sampling via SELECT

David Raymond
Along those lines SQLite includes the reverse_unordered_selects pragma
https://www.sqlite.org/pragma.html#pragma_reverse_unordered_selects
which will flip the order it sends rows in queries that don't explicitly specify an ordering. It's there to assist you in finding spots in your code where you might be relying on implicit ordering when you really shouldn't be.

Also available as a compile time option: SQLITE_REVERSE_UNORDERED_SELECTS


-----Original Message-----
From: sqlite-users <[hidden email]> On Behalf Of Simon Slavin
Sent: Thursday, November 7, 2019 12:16 PM
To: SQLite mailing list <[hidden email]>
Subject: Re: [sqlite] Deterministic random sampling via SELECT

On 7 Nov 2019, at 1:56pm, David Raymond <[hidden email]> wrote:

> Others will correct me if I'm wrong on that.

No correction, but I wanted to add something.

According to the theory of how SQL (not just SQLite, SQL) works, tables have no order.  You can, in theory, query a table of 100 rows with

SELECT a,b FROM MyTable LIMIT 5

ten times and get ten different answers, including different rows and/or the same rows in different orders.  Given some of the text in the post that started this thread, I just wanted to make sure this was understood.

In practise I have never seen a SQL engine which does this.  Each SQL implementation seems to return the same result every time you repeat the same query.  Though different SQL implementations can return different results.

You'd have thought that at least one server/client system would return five rows which happened to be in the cache, but I've never seen this.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Deterministic random sampling via SELECT

Merijn Verstraaten

> On 7 Nov 2019, at 19:16, David Raymond <[hidden email]> wrote:
>
> Along those lines SQLite includes the reverse_unordered_selects pragma
> https://www.sqlite.org/pragma.html#pragma_reverse_unordered_selects
> which will flip the order it sends rows in queries that don't explicitly specify an ordering. It's there to assist you in finding spots in your code where you might be relying on implicit ordering when you really shouldn't be.

Like the rest of this threads, this is just pointing out why the things in my initial email don't work, but I already knew that. Which is why I asked for help to see if there is a way to do what I want that *does* work. I don't care particularly about the details of "can I control the order the condition is evaluated", it's just that all reasonable ways to sample large streams that I know would require a deterministic order.

If someone has a different/better idea on how to return just a random sample from a query in a repeatable way, I'm all ears.

So far the only suggestion was "use some non-deterministic random sampling method and store the result", but since my samples are large and I have lots of them, this would balloon my storage by >100x and I don't have the available storage to make that work.

- Merijn

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

signature.asc (891 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Deterministic random sampling via SELECT

Donald Griggs
>
> Regarding: "So far the only suggestion was "use some non-deterministic
> random sampling method and store the result", but since my samples are
> large and I have lots of them, this would balloon my storage by >100x and I
> don't have the available storage to make that work."


But Keith Medcalf was suggesting your table KeySet store only the *keys*,
not the data itself, right?
Are you saying storing the ROWIDs would still be prohibitively expensive?

>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Deterministic random sampling via SELECT

Chris Peachment-4
In reply to this post by Merijn Verstraaten
In the very old days before computers were common, a random number
table appeared at the back of many statistical texts. This was used
to select a series of random numbers which would then be used as
look-up indices into some other data set.

You could do the same:

  1. generate a list of pseudo-random numbers, using a pre-defined
     seed value, over the range 1 .. count(*) of records in table,

  2. use that list as record id values to select the desired subset
     of the data in the table.

This would be done in two separate operations, possibly with a
storage of the generated numbers in a separate table which could
be used in the query of the main data.

Since it is a pseudo-random number series, you can repeat it as
often as needed using the same seed value.

Chris


On Thu, 7 Nov 2019, at 15:15, Merijn Verstraaten wrote:

>
> > On 7 Nov 2019, at 19:16, David Raymond <[hidden email]> wrote:
> >
> > Along those lines SQLite includes the reverse_unordered_selects pragma
> > https://www.sqlite.org/pragma.html#pragma_reverse_unordered_selects
> > which will flip the order it sends rows in queries that don't explicitly specify an ordering. It's there to assist you in finding spots in your code where you might be relying on implicit ordering when you really shouldn't be.
>
> Like the rest of this threads, this is just pointing out why the things
> in my initial email don't work, but I already knew that. Which is why I
> asked for help to see if there is a way to do what I want that *does*
> work. I don't care particularly about the details of "can I control the
> order the condition is evaluated", it's just that all reasonable ways
> to sample large streams that I know would require a deterministic order.
>
> If someone has a different/better idea on how to return just a random
> sample from a query in a repeatable way, I'm all ears.
>
> So far the only suggestion was "use some non-deterministic random
> sampling method and store the result", but since my samples are large
> and I have lots of them, this would balloon my storage by >100x and I
> don't have the available storage to make that work.
>
> - Merijn
>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
> Attachments:
> * signature.asc
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Deterministic random sampling via SELECT

David Raymond
In reply to this post by Merijn Verstraaten
We went off on a tangent, apologies.

If you have contiguous integer primary keys, you could randomly sample that range of integers, then pull the records with those keys.

Or in your external language of choice, sample the integers from 1 to the record count deterministically, select ordered by the primary key, and take the ones with the sampled offsets. (Stepping through 1 query and NOT doing a bunch of ...order by pk limit 1 offset n... queries)

Making something quick in Python I might do something like:


import random
import sqlite3

conn = sqlite3.connect(dbFile, isolation_level = None)
cur = conn.cursor()
cur.execute("select count(*) from foo;")
numRecords = cur.fetchone()[0]
sampleSize = 10
random.seed(5) #Your deterministic seed here
SampleOffsets = random.sample(range(1, numRecords + 1), sampleSize)
SampleOffsets.sort()
cur.execute("select * from foo order by primary_key;")
currentOffset = 0
for selectedOffset in SampleOffsets:
    for _ in range(selectedOffset - currentOffset - 1):
        cur.fetchone()
    nextSampleRecord = cur.fetchone()
    currentOffset = selectedOffset
    doSomethingWithSample(nextSampleRecord)



-----Original Message-----
From: sqlite-users <[hidden email]> On Behalf Of Merijn Verstraaten
Sent: Thursday, November 7, 2019 2:16 PM
To: SQLite mailing list <[hidden email]>
Subject: Re: [sqlite] Deterministic random sampling via SELECT


> On 7 Nov 2019, at 19:16, David Raymond <[hidden email]> wrote:
>
> Along those lines SQLite includes the reverse_unordered_selects pragma
> https://www.sqlite.org/pragma.html#pragma_reverse_unordered_selects
> which will flip the order it sends rows in queries that don't explicitly specify an ordering. It's there to assist you in finding spots in your code where you might be relying on implicit ordering when you really shouldn't be.

Like the rest of this threads, this is just pointing out why the things in my initial email don't work, but I already knew that. Which is why I asked for help to see if there is a way to do what I want that *does* work. I don't care particularly about the details of "can I control the order the condition is evaluated", it's just that all reasonable ways to sample large streams that I know would require a deterministic order.

If someone has a different/better idea on how to return just a random sample from a query in a repeatable way, I'm all ears.

So far the only suggestion was "use some non-deterministic random sampling method and store the result", but since my samples are large and I have lots of them, this would balloon my storage by >100x and I don't have the available storage to make that work.

- Merijn
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Deterministic random sampling via SELECT

Richard Damon
In reply to this post by Merijn Verstraaten
> On Nov 7, 2019, at 2:15 PM, Merijn Verstraaten <[hidden email]> wrote:
>
> 
>> On 7 Nov 2019, at 19:16, David Raymond <[hidden email]> wrote:
>>
>> Along those lines SQLite includes the reverse_unordered_selects pragma
>> https://www.sqlite.org/pragma.html#pragma_reverse_unordered_selects
>> which will flip the order it sends rows in queries that don't explicitly specify an ordering. It's there to assist you in finding spots in your code where you might be relying on implicit ordering when you really shouldn't be.
>
> Like the rest of this threads, this is just pointing out why the things in my initial email don't work, but I already knew that. Which is why I asked for help to see if there is a way to do what I want that *does* work. I don't care particularly about the details of "can I control the order the condition is evaluated", it's just that all reasonable ways to sample large streams that I know would require a deterministic order.
>
> If someone has a different/better idea on how to return just a random sample from a query in a repeatable way, I'm all ears.
>
> So far the only suggestion was "use some non-deterministic random sampling method and store the result", but since my samples are large and I have lots of them, this would balloon my storage by >100x and I don't have the available storage to make that work.
>
> - Merijn
>

One thought would be to generate a ‘hash’ from part of the record, maybe the record ID, and select records based on that value. The simplest would be something like id%100 == 0 would get you 1% of the records. That admittedly isn’t that random.

Put the ID through a linear congruential generator, something like

mod(a * Id + b, c) % 100 == 0

And you will pretty well scramble the selection
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Deterministic random sampling via SELECT

Doug Currie-2
On Thu, Nov 7, 2019 at 4:23 PM Richard Damon <[hidden email]>
wrote:

>
> One thought would be to generate a ‘hash’ from part of the record, maybe
> the record ID, and select records based on that value. The simplest would
> be something like id%100 == 0 would get you 1% of the records. That
> admittedly isn’t that random.
>
> Put the ID through a linear congruential generator, something like
>
> mod(a * Id + b, c) % 100 == 0
>
> And you will pretty well scramble the selection
>
>
Yes, and if a, b, and c come from a randomization table, they can be
modified to obtain a different pseudo-random set.

e
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Deterministic random sampling via SELECT

Richard Damon
On 11/7/19 5:13 PM, Doug Currie wrote:

> On Thu, Nov 7, 2019 at 4:23 PM Richard Damon <[hidden email]>
> wrote:
>
>> One thought would be to generate a ‘hash’ from part of the record, maybe
>> the record ID, and select records based on that value. The simplest would
>> be something like id%100 == 0 would get you 1% of the records. That
>> admittedly isn’t that random.
>>
>> Put the ID through a linear congruential generator, something like
>>
>> mod(a * Id + b, c) % 100 == 0
>>
>> And you will pretty well scramble the selection
>>
>>
> Yes, and if a, b, and c come from a randomization table, they can be
> modified to obtain a different pseudo-random set.
>
> e
a, b, and c should be chosen to give a reasonable random number
generator (there are tables of good values), not be arbitrary values.
The 100 and the 0 can be changed (or use some other test on the random
number) to get different selections.

--
Richard Damon

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Deterministic random sampling via SELECT

Merijn Verstraaten
In reply to this post by Chris Peachment-4
On 7 Nov 2019, at 20:47, Chris Peachment <[hidden email]> wrote:
>  1. generate a list of pseudo-random numbers, using a pre-defined
>     seed value, over the range 1 .. count(*) of records in table,
>
>  2. use that list as record id values to select the desired subset
>     of the data in the table.
>
> This would be done in two separate operations, possibly with a
> storage of the generated numbers in a separate table which could
> be used in the query of the main data.

Yeah, this and some of the other ideas were some things I considered as fallback ideas if things weren't possible within the query, although it does complicate things a bit.

I actually just had another idea: How is the behaviour of window functions defined? i.e. if there is an ORDER BY clause are rows added/removed from the window in the order of the order by clause, or is the behaviour of row_number/rank/dense_rank special cased and only those functions guarantee the order?

Because if window functions do follow there own order by, you could easily recover a deterministic evaluation order via the window specification.

If not, I guess I'll have to give up and do it the "hard way".

- Merijn
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users