Limiting the result set size at each recursive step in a CTE

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Limiting the result set size at each recursive step in a CTE

Thomas Zimmermann
Hi!

Sometimes it is desirable to limit the size of the queue¹ in a recursive
CTE when the
domain allows for it. That puts an upper bound on processing time, even
when the given table is huge.
This can be achieved by adding a LIMIT clause at every step (setup and
recursive).
But there is my problem: The setup step allows subqueries, but the
recursive step doesn't.
Is there a general solution available?

As a concrete example, consider the following table for a commenting system:

CREATE TABLE comment (
     comment_id INTEGER PRIMARY KEY,
     parent_comment_id INTEGER REFERENCES comment (comment_id),
     created_at INTEGER NOT NULL -- timestamp, bigger means newer
);
CREATE INDEX comment_hierarchy ON comment (parent_comment_id, created_at
DESC);

The comments should be arranged by hierarchy and creation date, newest
first.
Only up to 100 comments should be shown. Example:

Comment 2 (2019-05-03)
- Comment 3 (2019-05-04)
- Comment 4 (2019-05-05)
-- Comment 5 (2019-05-05)
Comment 1 (2019-05-02)

The following query should accomplish this, but the optimization isn't
allowed (see comments inline):

WITH RECURSIVE
     sorted_comment(comment_id, parent_comment_id, created_at, depth) AS (
         -- Start with all comments at the root level.
         -- Optimization: Only consider the latest 100 comments, since
that's the maximum amount we could pick from this level.
         -- This guarantees good performance even when there are
millions of comments at the root level.
         SELECT *
         FROM (
             SELECT comment_id, parent_comment_id, created_at, 0 AS depth
             FROM comment
             WHERE parent_comment_id IS NULL
             ORDER BY created_at DESC
             LIMIT 100
         )

         UNION ALL

         -- Find all direct descendants of the current comment from the
queue.
         -- Same optimization: There might be many comments at this
level, but we could oly ever consider up to the latest 100.
         -- (Actually, we only need to consider (100 -
COUNT(sorted_comment)), but let's ignore that for now.)
         -- NOTE: This doesn't actually work, Error: recursive reference
in a subquery: sorted_comment
         SELECT *
         FROM (
             SELECT c.comment_id, c.parent_comment_id, c.created_at,
sc.depth + 1 AS depth
             FROM comment AS c
             JOIN sorted_comment AS sc
             ON c.parent_comment_id = sc.comment_id
             ORDER BY created_at DESC
             LIMIT 100
         )
         -- Do a depth-first search, picking the latest comment from the
queue.
         ORDER BY depth DESC, created_at DESC
         LIMIT 100
     )
SELECT comment_id, parent_comment_id, created_at, depth
FROM sorted_comment;

I would be very interested in a general solution that still allows for
the adjacency list design,
but I'm open to denormalization. :)

Kind regards,
Thomas

¹ As defined in the documentation here: https://sqlite.org/lang_with.html
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Limiting the result set size at each recursive step in a CTE

R Smith-2
On 2019/05/07 7:57 PM, Thomas Zimmermann wrote:
> Hi!
>
> Sometimes it is desirable to limit the size of the queue¹ in a
> recursive CTE//...

> CREATE TABLE comment (
>     comment_id INTEGER PRIMARY KEY,
>     parent_comment_id INTEGER REFERENCES comment (comment_id),
>     created_at INTEGER NOT NULL -- timestamp, bigger means newer
> );
> CREATE INDEX comment_hierarchy ON comment (parent_comment_id,
> created_at DESC);
>
> WITH RECURSIVE //...

> I would be very interested in a general solution that still allows for
> the adjacency list design,
> but I'm open to denormalization. :)


It's tricky, but there is a solution now that Window-functions have
joined the fray.

Essentially what we need is to first extract from the table a list of
all the sub comments by parent comment, but while the query order is not
helpful, we can partition by the parent-comment-ID's and number the rows
(which we CAN apply a sort-order to thanks to the Window function
methodology), so the first CTE does just that.

The next step is to list the origin rows (which have parent_id = NULL)
and then amend to them every sub-comment belonging to them, but only
where the created row-order ID is less than 100 (or whatever value you
pick).

Lastly, we only use full sorting order in the very final query (outside
the CTE's) according to main comment dates and sub-comment id's, which
would make things fastest. This solution will work universally for all
similar types of tree-queries.

While this compiles/runs, I have not been able to test this with data
because you gave no data and I was too lazy to make up data, but I'm
pretty sure it would work. If it doesn't, please send some data and
expected outcome, then we can fix it.

WITH RECURSIVE
     comments_by_parent(subid, comment_id, parent_comment_id,
created_at) AS (
         SELECT ROW_NUMBER() OVER (PARTITION BY c.parent_comment_id
ORDER BY c.created_at DESC) AS subid,
                c.comment_id, c.parent_comment_id, c.created_at
           FROM comment AS c
          WHERE c.parent_comment_id IS NOT NULL
),  sorted_comment(comment_id, created_at, sub_comment_id,
sub_created_at, depth, idx) AS (
         SELECT comment_id, created_at, NULL, NULL, 0, 0
         FROM (SELECT comment_id, created_at
                 FROM comment
                WHERE parent_comment_id IS NULL
                ORDER BY created_at DESC
                LIMIT 100
              )
         UNION ALL
         SELECT sc.comment_id, sc.created_at, cp.comment_id,
cp.created_at, sc.depth + 1, cp.subid
           FROM sorted_comment AS sc
           JOIN comments_by_parent AS cp ON cp.parent_comment_id =
sc.comment_id
          WHERE cp.subid < 100
)
SELECT comment_id, created_at, sub_comment_id, sub_created_at, depth, idx
   FROM sorted_comment
  ORDER BY created_at, comment_id, idx
;


Good luck!
Ryan




_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users