Efficient ways to maintaining order rank / row_number() in a rather large set ?

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Efficient ways to maintaining order rank / row_number() in a rather large set ?

Eric Grange-3
Hi,

I have a problem where I have a large set of (key, value) which I want to
sort by value, and then store in a table having (rank, key, value) fields,
so that for a given key I can quickly find the rank, or for a given rank
range, I can quickly list the keys & values.

Since there is no ROW_NUMBER() function, but there is an autoincrement
feature, and the rank are numbered 1, 2, 3 etc. the strategy I have
been using is to create ranked table like

   CREATE RANKED (
      RANK INTEGER PRIMARY KEY AUTOINCREMENT,
      KEY INTEGER,
      VALUE FLOAT
   )

(+ an index for the key)

and then I fill that table with something like

   INSERT INTO RANKED
      SELECT key, value
      FROM ...something rather complex and big...
      ORDER BY value desc

This works well enough, but as the amount of values to be ranked increases,
this feels wasteful to delete everything and then re-insert
everything just to adjust the RANK column, also I am running into memory
issues as the ORDER BY requires a temporary b-tree
which runs into the gigabyte range in some instances.

I have ways to maintain the KEY and VALUES individually and incrementally,
but approaches I have tried to maintain the RANK
with UPDATE queries ran much slower than deleting and recreating
everything, though this could just be bad implementations
from my part.

Are there any other strategies I could use that could update just the RANK
field and mitigate the temporary B-tree size?

Eric
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient ways to maintaining order rank / row_number() in a rather large set ?

Simon Slavin-3
On 9 Jan 2018, at 9:50am, Eric Grange <[hidden email]> wrote:

> then I fill that table with something like
>
>   INSERT INTO RANKED
>      SELECT key, value
>      FROM ...something rather complex and big...
>      ORDER BY value desc
>
> This works well enough, but as the amount of values to be ranked increases,
> this feels wasteful to delete everything and then re-insert
> everything just to adjust the RANK column, also I am running into memory
> issues as the ORDER BY requires a temporary b-tree
> which runs into the gigabyte range in some instances.

The ORDER BY clause serves no useful value here.  Leave it out.  Do your sorting when you query the table.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient ways to maintaining order rank / row_number() in a rather large set ?

Eric Grange
You mean using limit / offset instead ?

Even with an index on the VALUE column, queries like

    select * from ranked
    order by value
    limit 10 offset xxx

become very slow when xxx is great, while

    select * from ranked
    order by rank
    where rank between xxx and xxx+9

are fast regardless of the value of xxx

Similarly finding the rank of a key becomes sluggish for keys that are not
in the top without

So the order by is used to control the insertion order, so that the RANK
autoinc primary key ends up with natural rank order



On Tue, Jan 9, 2018 at 10:59 AM, Simon Slavin <[hidden email]> wrote:

> On 9 Jan 2018, at 9:50am, Eric Grange <[hidden email]> wrote:
>
> > then I fill that table with something like
> >
> >   INSERT INTO RANKED
> >      SELECT key, value
> >      FROM ...something rather complex and big...
> >      ORDER BY value desc
> >
> > This works well enough, but as the amount of values to be ranked
> increases,
> > this feels wasteful to delete everything and then re-insert
> > everything just to adjust the RANK column, also I am running into memory
> > issues as the ORDER BY requires a temporary b-tree
> > which runs into the gigabyte range in some instances.
>
> The ORDER BY clause serves no useful value here.  Leave it out.  Do your
> sorting when you query the table.
>
> Simon.
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient ways to maintaining order rank / row_number() in a rather large set ?

Andy Ling-2
In reply to this post by Eric Grange-3
Isn't this really a repeat of this thread...

http://sqlite.1065341.n5.nabble.com/how-into-insert-row-into-middle-of-table-with-integer-primary-key-td98629.html

The result of which was, don't try and use the table row order to sort your data. Add a column that defines your sort order
and do the sorting on output, not input.

I rather liked Jens solution to use a string to define the sort order. (top of second page of thread)

Andy Ling


-----Original Message-----
From: sqlite-users [mailto:[hidden email]] On Behalf Of Eric Grange
Sent: Tue 09 January 2018 10:26
To: SQLite mailing list
Subject: [External] Re: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?

You mean using limit / offset instead ?

Even with an index on the VALUE column, queries like

    select * from ranked
    order by value
    limit 10 offset xxx

become very slow when xxx is great, while

    select * from ranked
    order by rank
    where rank between xxx and xxx+9

are fast regardless of the value of xxx

Similarly finding the rank of a key becomes sluggish for keys that are not
in the top without

So the order by is used to control the insertion order, so that the RANK
autoinc primary key ends up with natural rank order



On Tue, Jan 9, 2018 at 10:59 AM, Simon Slavin <[hidden email]> wrote:

> On 9 Jan 2018, at 9:50am, Eric Grange <[hidden email]> wrote:
>
> > then I fill that table with something like
> >
> >   INSERT INTO RANKED
> >      SELECT key, value
> >      FROM ...something rather complex and big...
> >      ORDER BY value desc
> >
> > This works well enough, but as the amount of values to be ranked
> increases,
> > this feels wasteful to delete everything and then re-insert
> > everything just to adjust the RANK column, also I am running into memory
> > issues as the ORDER BY requires a temporary b-tree
> > which runs into the gigabyte range in some instances.
>
> The ORDER BY clause serves no useful value here.  Leave it out.  Do your
> sorting when you query the table.
>
> Simon.
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
---------------------------------------------------------------------------------------
This email has been scanned for email related threats and delivered safely by Mimecast.
For more information please visit http://www.mimecast.com
---------------------------------------------------------------------------------------

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient ways to maintaining order rank / row_number() in a rather large set ?

Simon Slavin-3
In reply to this post by Eric Grange


On 9 Jan 2018, at 10:26am, Eric Grange <[hidden email]> wrote:

> You mean using limit / offset instead ?
>
> Even with an index on the VALUE column, queries like
>
>    select * from ranked
>    order by value
>    limit 10 offset xxx
>
> become very slow when xxx is great

Yeah, to do OFFSET SQLite has to do the whole SELECT, it just throws away the first few entries without reporting them to you.

Do you actually have a need to find the 4512nd rank ?  I was expecting you to just need to need to list the rows in rank order, meaning you would never need OFFSET.

Okay, if you do need to do things like find the 4512nd rank, and you want to be able to insert the values without taking the great amount of memory needed for the ORDER BY,

1) declare an index on the VALUE column (fastest to do the INSERT first then create the index, but might not matter for your data)
2) do the INSERT without ORDER BY
3) In your programming language, number the rows by doing

        SELECT KEY FROM RANKED ORDER BY VALUE

and for each row returned doing an UPDATE for the RANK value.

It’ll be slower, but it won’t take up the huge chunk of memory needed to keep that index in memory.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient ways to maintaining order rank / row_number() in a rather large set ?

Dominique Devienne
In reply to this post by Eric Grange
On Tue, Jan 9, 2018 at 11:26 AM, Eric Grange <[hidden email]> wrote:

> So the order by is used to control the insertion order, so that the RANK
> autoinc primary key ends up with natural rank order


But then, if your range queries are based on a rank derived from value, why
not index value directly?
You'd still get fast range queries based on values, no? That probably
forces you app to maintain a
mapping from "rank" to values, but you don't need to know all ranks, just
the few necessary to know
where to "page" from, no? (assuming I'm guess what you use rank for
correctly).

SQLite then maintain that index automatically, no? I'm probably missing
something though. Sounds too simple... --DD
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient ways to maintaining order rank / row_number() in a rather large set ?

Eric Grange
> Do you actually have a need to find the 4512nd rank ?

Yes, this is is used to display and check on "neighbors", ie. keys with
similar rank.
The other very common query is to show the rank for a given key.

In both cases, since things are constantly in flux, the absolute rank and
neighbor do not really matter
(outside the top ranks), but changes in rank are being looked at.
i.e. having rank 155k or 154k is not really meaningful in itself, but on
the other hand
gaining 1000 "ranks" by going from 155k to 154k is what users are after.


> 3) In your programming language, number the rows by doing
>
>        SELECT KEY FROM RANKED ORDER BY VALUE
>
>and for each row returned doing an UPDATE for the RANK value.
>It’ll be slower, but it won’t take up the huge chunk of memory needed to
keep that index in memory.

Yes, that's what I tried, but it is indeed much slower than deleting the
whole table an inserting everything all at once,
because as soon as one key moves from top tier to bottom (or vice versa),
you have to touch basically all records,
and doing so with many updates wrecks the performance completely.

> But then, if your range queries are based on a rank derived from value,
why
> not index value directly? You'd still get fast range queries based on
values, no?

You get fast value range queries, but rank range queries become slower and
slower the farther away you get from the top rank.

Also while most end-user queries are for the top 100 / top 1000, those
results are cached and only infrequently
hit the db (so even if finding out the top 1000 was slow and inefficient,
it would not really matter).
In practice, the queries that really hit on the db are for "random" keys
far from the top 1000.

Eric


On Tue, Jan 9, 2018 at 11:44 AM, Dominique Devienne <[hidden email]>
wrote:

> On Tue, Jan 9, 2018 at 11:26 AM, Eric Grange <[hidden email]> wrote:
>
> > So the order by is used to control the insertion order, so that the RANK
> > autoinc primary key ends up with natural rank order
>
>
> But then, if your range queries are based on a rank derived from value, why
> not index value directly?
> You'd still get fast range queries based on values, no? That probably
> forces you app to maintain a
> mapping from "rank" to values, but you don't need to know all ranks, just
> the few necessary to know
> where to "page" from, no? (assuming I'm guess what you use rank for
> correctly).
>
> SQLite then maintain that index automatically, no? I'm probably missing
> something though. Sounds too simple... --DD
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient ways to maintaining order rank / row_number() in a rather large set ?

Dominique Devienne
On Tue, Jan 9, 2018 at 12:35 PM, Eric Grange <[hidden email]> wrote:

> > But then, if your range queries are based on a rank derived from value,
> why
> > not index value directly? You'd still get fast range queries based on
> values, no?
>
> You get fast value range queries, but rank range queries become slower and
> slower the farther away you get from the top rank.
>

As I wrote, that can be (greatly IMHO) mitigated by having a partial
mapping from values to ranks,
across the value range, which thus allows to restrict your query to a
value-range that's larger than
the exact rank-range you want, but still narrow enough to be fast.

And another thought is that you may be able to derive than partial mapping
from the stats ANALYZE [1]
extracts and stores in [2] or [3], if you don't want to do it in your app.
(stats on the value column of your
original table, not the ranked "derived" one).

Also, I can't find the name right now, and thus no link sorry, but SQLite
has a way to "mount" an index
as a table, if I recall correctly (mostly for troubleshooting if I
remember), not sure if that's super useful
since you'd probably need access to low-level b-tree info I think estimate
ranks from that I think, the
stat(1|3|4) table info is a better avenue I think. --DD

[1] https://www.sqlite.org/lang_analyze.html
[2] https://www.sqlite.org/fileformat2.html#stat3tab
[3] https://www.sqlite.org/fileformat2.html#stat4tab
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient ways to maintaining order rank / row_number() in a rather large set ?

Dinu
In reply to this post by Eric Grange
The only way to efficiently do this would be to have counting (range) index
b-trees. Since you don't, you're stuck with a O(n) implementation, either on
reading or writing. So your solution is as good as it gets, save maybe some
implementation particularities.

However, you may consider a shift in perspective: with this kind of data,
statisticians use percentiles.
That is, instead of querying for ranks 21,000-22,000, you query for "top
1%", "6-8%" etc, based on either value or rank; this way, you can maintain a
percentile rank table as granular as you like (i.e. every 1% results in a
table with 100 lines). Each line would have count, value min, value max.
Such a table is much faster to update and then if you need to retrieve the
actual records, you use by range (value BETWEEN min AND max) joined with the
percentile table.



--
Sent from: http://sqlite.1065341.n5.nabble.com/
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient ways to maintaining order rank / row_number() in a rather large set ?

Dinu
In reply to this post by Eric Grange-3
Analogous to the percentile solution (it's actually the same thing), you can
use a checkpointing table. This has roughly the complexity of SQRT(n) for
both read and write.

I.E. say you expect to have 1M records and define order based on value then
id.

You then make a checkpoint table (first_rank,value,rec_id) holding every
1000th record in sorted order. So row 1 of checkpoint table coresponds to
the 1000th sorted record.

When you insert/delete a row, you only need to update checkpoints that come
after said row.

When you are searching for row 4521, you do something like:

SELECT
FROM
    table
JOIN
    checkpoint
WHERE
    (
        (table.value=checkpoint.value AND table.id>=checkpoint.id) OR
        table.value>checkpoint.value
    ) AND
    checkpoint.first_rank=4500
ORDER BY
    table.value ASC,table.id ASC
LIMIT 21,1



--
Sent from: http://sqlite.1065341.n5.nabble.com/
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient ways to maintaining order rank / row_number() in a rather large set ?

Wade, William
In reply to this post by Eric Grange-3
If you were going to do this entirely in memory (perhaps in C, or some similar language), you would likely use some tree structure where each node keeps track of the number of descendants (direct and indirect) of that node. That allows the operations you describe to occur in O(log(N)) time. Single-record insert/delete/update has the same time complexity.

It is likely that for your tree, every node would have the same structure (struct in C), or else every internal node would have one structure, and every leaf node would have another structure.

Now given a bunch of objects with the same structure, you can easily store them in a relational database, rather than in memory, and perform similar operations on them. A collection of struct instances turns into a table, and a C pointer turns into a row-id (or similar). This isn't entirely free, of course. In C we think of a pointer dereference as occurring in constant time, while in a database, a key lookup is typically log(N) time, but still, your log(N) in-memory solution becomes a log-squared(N) database solution, and that is usually fast enough.

Of course you lose some of the database "convenience. You're essentially implementing trees which are close to those that already "free" in sqlite. Likewise, some simple SQL queries turn into something more complex (since you need to maintain your tree). At least you still get the ACID benefits.

If I google "counting tree in sqlite" I see some hits that, perhaps, already do this kind of thing.

Regards,
Bill

-----Original Message-----
From: Eric Grange [mailto:[hidden email]]
Sent: Tuesday, January 9, 2018 3:51
To: General Discussion of SQLite Database <[hidden email]>
Subject: [sqlite] Efficient ways to maintaining order rank / row_number() in a rather large set ?

Hi,

I have a problem where I have a large set of (key, value) which I want to sort by value, and then store in a table having (rank, key, value) fields, so that for a given key I can quickly find the rank, or for a given rank range, I can quickly list the keys & values.

Since there is no ROW_NUMBER() function, but there is an autoincrement feature, and the rank are numbered 1, 2, 3 etc. the strategy I have been using is to create ranked table like

   CREATE RANKED (
      RANK INTEGER PRIMARY KEY AUTOINCREMENT,
      KEY INTEGER,
      VALUE FLOAT
   )

(+ an index for the key)

and then I fill that table with something like

   INSERT INTO RANKED
      SELECT key, value
      FROM ...something rather complex and big...
      ORDER BY value desc

This works well enough, but as the amount of values to be ranked increases, this feels wasteful to delete everything and then re-insert everything just to adjust the RANK column, also I am running into memory issues as the ORDER BY requires a temporary b-tree which runs into the gigabyte range in some instances.

I have ways to maintain the KEY and VALUES individually and incrementally, but approaches I have tried to maintain the RANK with UPDATE queries ran much slower than deleting and recreating everything, though this could just be bad implementations from my part.

Are there any other strategies I could use that could update just the RANK field and mitigate the temporary B-tree size?

Eric


**************************************************************************************
This e-mail and any attachments thereto may contain confidential information and/or information protected by intellectual property rights for the exclusive attention of the intended addressees named above. If you have received this transmission in error, please immediately notify the sender by return e-mail and delete this message and its attachments. Unauthorized use, copying or further full or partial distribution of this e-mail or its contents is prohibited.
**************************************************************************************
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient ways to maintaining order rank / row_number() in a rather large set ?

Simon Slavin-3
In reply to this post by Eric Grange
On 9 Jan 2018, at 11:35am, Eric Grange <[hidden email]> wrote:

> In both cases, since things are constantly in flux, the absolute rank and
> neighbor do not really matter
> (outside the top ranks), but changes in rank are being looked at.
> i.e. having rank 155k or 154k is not really meaningful in itself, but on
> the other hand
> gaining 1000 "ranks" by going from 155k to 154k is what users are after.

Okay, I see what you mean.  You have a special, unusual, case where the ranks change frequently and you have a genuine need to do things like "give me all the ranks from 154k to 155k".

That’s a very difficult thing to do quickly.  The fastest way to do it would be different depending on which you do more: change ranks or query ranks.

The cannonical SQL way would be that every time a figure changes you would change the ranks of all the rows between the two positions.  This would take only two UPDATE commands each time.  The data would be up-to-date all the time and therefore queries would be fast.  But if you have million rows this could involve a lot of number-shuffling and I can see that it might not work out in the real world.

>> But then, if your range queries are based on a rank derived from value, why
>> not index value directly? You'd still get fast range queries based on values, no?
>
> You get fast value range queries, but rank range queries become slower and
> slower the farther away you get from the top rank.

This should not be true.  You should not be using OFFSET.  Your queries should be something like

        SELECT * FROM ranked WHERE rank BETWEEN 154000 AND !55000 ORDER BY rank

which should be lightning fast because you have an index on "rank".

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient ways to maintaining order rank / row_number() in a rather large set ?

Eric Grange
> But if you have million rows this could involve a lot of number-shuffling
and I can see that it might not work out in the real world.

Yes, also while value field can change several times per seconds
(thousandths in some cases), it is acceptable to have the ranking be
updated at a lower frequency and display somewhat "stale" ranking & values.

> This should not be true.  You should not be using OFFSET.  Your queries
should be something like

That was with only the "value" field being indexed (so without a rank
field), is there a better way than OFFSET in that case ?


On Tue, Jan 9, 2018 at 6:21 PM, Simon Slavin <[hidden email]> wrote:

> On 9 Jan 2018, at 11:35am, Eric Grange <[hidden email]> wrote:
>
> > In both cases, since things are constantly in flux, the absolute rank and
> > neighbor do not really matter
> > (outside the top ranks), but changes in rank are being looked at.
> > i.e. having rank 155k or 154k is not really meaningful in itself, but on
> > the other hand
> > gaining 1000 "ranks" by going from 155k to 154k is what users are after.
>
> Okay, I see what you mean.  You have a special, unusual, case where the
> ranks change frequently and you have a genuine need to do things like "give
> me all the ranks from 154k to 155k".
>
> That’s a very difficult thing to do quickly.  The fastest way to do it
> would be different depending on which you do more: change ranks or query
> ranks.
>
> The cannonical SQL way would be that every time a figure changes you would
> change the ranks of all the rows between the two positions.  This would
> take only two UPDATE commands each time.  The data would be up-to-date all
> the time and therefore queries would be fast.  But if you have million rows
> this could involve a lot of number-shuffling and I can see that it might
> not work out in the real world.
>
> >> But then, if your range queries are based on a rank derived from value,
> why
> >> not index value directly? You'd still get fast range queries based on
> values, no?
> >
> > You get fast value range queries, but rank range queries become slower
> and
> > slower the farther away you get from the top rank.
>
> This should not be true.  You should not be using OFFSET.  Your queries
> should be something like
>
>         SELECT * FROM ranked WHERE rank BETWEEN 154000 AND !55000 ORDER BY
> rank
>
> which should be lightning fast because you have an index on "rank".
>
> Simon.
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Efficient ways to maintaining order rank / row_number() in a rather large set ?

Simon Slavin-3
On 12 Jan 2018, at 10:31am, Eric Grange <[hidden email]> wrote:
>
>> This should not be true.  You should not be using OFFSET.  Your queries
>> should be something like
>
> That was with only the "value" field being indexed (so without a rank
> field), is there a better way than OFFSET in that case ?

Can’t think of one.  OFFSET is the easiest way to get what you want if you’re not storing the values you’re searching/sorting on.  But as you’ve seen, SQLite derives OFFSET by compiling the whole list, from the first result, and simply not reporting the first entries.  So it’s slow.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users