Why doesn't SQLite optimise this subselect?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Why doesn't SQLite optimise this subselect?

Jonathan Moules-3
Hi All,
   This is more of an academic question as I've come up with a better
query, but I was wondering why SQLite doesn't optimise this query.

Lets say I have two tables, simplified here. One contains webpage
contents and a unique hash of those contents (the primary key), the
other contains a history of lookups. The hash is used as a foreign key
between the two tables.

Table: webpage_contents
content_hash -- Primary key
post_processing_info
page_content

This table has about 10,000 rows

crawling_lookups
content_hash -- foreign key, has an index
is_json -- a flag indicating if it's JSON
is_html -- a flag indicating if it's HTML

This table has 500,000 rows

(Note: I appreciate this is bad design now and the flags should be in
webpage content; it evolved organically).

Now, I want to query the content table post_processing_info, and join in
the flags from the lookups at the same time. After a bit of fiddling
around to stop duplication happening, I got this:

SELECT
     w.post_processing_info,
     l.is_json
FROM
     webpage_contents w
JOIN
     (SELECT * from lookup_tables.lookups GROUP BY content_xxhash) l
     USING (content_hash)
WHERE
     content_hash = 'abc'


This takes about 2 seconds to run, which is quite slow given
content_hash is indexed in both tables and it's running from an SSD.

I appreciate there are better ways to write this (and I'm using one now,
I was a wee bit meandering getting there though), but it I was left
wondering - why did SQLite not optimise the subquery?

The Explain Query Plan for that is:

1    0    0    SCAN TABLE crawling_lookups USING INDEX content_hash_idx
0    0    0    SEARCH TABLE webpage_contents AS c USING INDEX
sqlite_autoindex_page_contents_1 (content_hash=?)
0    1    1    SCAN SUBQUERY 1 AS l

It seems to be doing the group on the subselect using the index (fast -
about 0.1 seconds), then at step 3 doing a scan of the subquery results
for the content_hash. This would be the slow part as that doesn't have
an index.
I figured SQL would spot that the where clause in the outermost scope
would also apply to the subselect and put it in there automatically so
the group-by would only be for the things the WHERE clause affected.

Is this an optimisation opportunity? Or is my SQL so bad it was
inevitable (more likely)?

Cheers,
Jonathan

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Why doesn't SQLite optimise this subselect?

Dinu
Could you describe your intent in more detail?

SELECT * FROM ... GROUP BY is unstable at best, so I cannot really grasp
your intention. Which lookup record's

If I guess right, you might want something in the lines of:

SELECT
     w.post_processing_info,
     l.is_json
FROM
     webpage_contents w
JOIN
     (SELECT DISTINCT is_json from lookup_tables.lookups) l
     USING (content_hash)
WHERE
     content_hash = 'abc'

(this might return 0-3 rows if lookup_tables contains 3 values (0,1,and
NULL) for is_json)

which in turn I think would optimize best if written as:

SELECT DISTINCT
     w.content_hash,
     w.post_processing_info,
     l.is_json
FROM
     webpage_contents w
JOIN
     lookup_tables.lookups l
     USING (content_hash)
WHERE
     content_hash = 'abc'




--
Sent from: http://sqlite.1065341.n5.nabble.com/
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Why doesn't SQLite optimise this subselect?

Dinu
If you want just 1 value (any) from lookup, you can use:

SELECT
     post_processing_info,
     (
        SELECT
            is_json
        FROM
            tables_lookup
        WHERE
            tables_lookup.content_hash=webpage_contents.content_hash
        LIMIT 1
     ) AS is_json
FROM
     webpage_contents
WHERE
     content_hash = 'abc'

If you want only one value,
or

SELECT
    post_processing_info,
    x.is_json
FROM
     webpage_contents
JOIN
     (
        SELECT
            content_hash,
            MIN(is_json)
        FROM
            tables_lookup
        GROUP BY
            content_hash
     ) AS x ON
         x.content_hash=webpage_contents.content_hash
FROM
     webpage_contents

if you want the full join... I'm using MIN() here as an arbitrary function
to get just one value from the group.



--
Sent from: http://sqlite.1065341.n5.nabble.com/
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users