Knowing when not to use a R-Tree index, and clustering tables along spatial indexes

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Knowing when not to use a R-Tree index, and clustering tables along spatial indexes

Andrea Aime
Hi,
I'm writing some software that can read data off GeoPackage (SQLite + rtree
+ standardized
set of metadata tables) as an alternative format for spatial databases,
like PostgreSql with the
PostGIS extension.

Now, when I use PostGIS the query plan optimizer checks the bbox provided
in the query
and verifies if using the spatial index is a good idea, or not. At
conferences I've been told
that the query has to be rather selective (e.g., retrieve less than 10% of
the data) in order
for the index to actually be used.

With SQLite R-Tree I'm using either a join with the index virtual table, or
a subquery
retrieving the ids from the rtree. Regardless, the query is basically
ordering SQLite
to use the index.
So I was wondering, is there any opportunity to run a blazing fast
pre-query against
the index that will tell me whether joining/subquerying into the rtree is
going to be a win, or not?

Also, while I'm here, in PostGIS there is an option to cluster a table
along the spatial
index, in order to reduce IO when the spatial index is the main access
driver (which is often
the case in geographic information systems). I looked at tables with no
rowids, but
it does not seem like a way to do it (spatial index not being suitable for
primary key).
Anything else that could be done here?

Cheers
Andrea
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Knowing when not to use a R-Tree index, and clustering tables along spatial indexes

Clemens Ladisch
Andrea Aime wrote:
> So I was wondering, is there any opportunity to run a blazing fast pre-query against
> the index that will tell me whether joining/subquerying into the rtree is going to be a win, or not?

Each node in an R-tree index stores the coordinates of the leaf objects/child nodes;
see <https://stackoverflow.com/a/25242162/11654> for how to get this information.
To get the index extents, combine the extents of all entries in the root node.

> Also, while I'm here, in PostGIS there is an option to cluster a table along the spatial
> index, in order to reduce IO when the spatial index is the main access driver (which is often
> the case in geographic information systems). I looked at tables with no rowids, but
> it does not seem like a way to do it (spatial index not being suitable for primary key).
> Anything else that could be done here?

For a static table, you could order the rows along a space-filling curve.
For a dynamic table, you would have to replace the R-tree with something else
(<https://en.wikipedia.org/wiki/Hilbert_R-tree>), which would no longer be portable.


Regards,
Clemens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Knowing when not to use a R-Tree index, and clustering tables along spatial indexes

Wolfgang Enzinger
In reply to this post by Andrea Aime
Am Fri, 29 Dec 2017 19:59:12 +0100 schrieb Andrea Aime:

> With SQLite R-Tree I'm using either a join with the index virtual table, or
> a subquery
> retrieving the ids from the rtree. Regardless, the query is basically
> ordering SQLite
> to use the index.
> So I was wondering, is there any opportunity to run a blazing fast
> pre-query against
> the index that will tell me whether joining/subquerying into the rtree is
> going to be a win, or not?

I had good results in a similar situation with this strategy:

First, query the overall extent of your data, like this:
SELECT MIN(mingx),MAX(maxgx),MIN(mingy),MAX(maxgy) FROM flst_shape_index;

Second, for every spatial query, calculate the size of the area in
question.

Then, with these two rectangles, you can calculate the LIKELIHOOD that a
particular record in your data is located within the requested area, i.e.
meets the spatial criteria.

Let SQLite know about that likelihood in a JOIN query, using the LIKELIHOOD
function (http://www.sqlite.org/lang_corefunc.html#likelihood). Also, if
possible, give LIKELIHOOD information to the query planner for any other
criteria used. SQLite will consider them.

HTH Wolfgang

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Knowing when not to use a R-Tree index, and clustering tables along spatial indexes

Clemens Ladisch
Wolfgang Enzinger wrote:
> First, query the overall extent of your data, like this:
> SELECT MIN(mingx),MAX(maxgx),MIN(mingy),MAX(maxgy) FROM flst_shape_index;

This results in a full table scan.  Instead of caching these values manually,
it would be a better idea to read them from the index:

  SELECT rtreenode(2, data) FROM flst_shape_index_node WHERE nodeno = 1;

(rtreenode() is undocumented; maybe you should use your own decoder.)

> Let SQLite know about that likelihood in a JOIN query

This does not appear to change anything with a virtual table:

  CREATE TABLE t(id INETGER PRIMARY KEY, x, [...]);
  CREATE VIRTUAL TABLE i USING rtree(id, minx, maxx);

  SELECT t.* FROM t JOIN i USING (id) WHERE likelihood(i.minx BETWEEN 10 AND 20, 0.000001);
  --EQP-- 0,0,1,SCAN TABLE i VIRTUAL TABLE INDEX 2:D0B0
  --EQP-- 0,1,0,SEARCH TABLE t USING INTEGER PRIMARY KEY (rowid=?)
  SELECT t.* FROM t JOIN i USING (id) WHERE likelihood(i.minx BETWEEN 10 AND 20, 0.999999);
  --EQP-- 0,0,1,SCAN TABLE i VIRTUAL TABLE INDEX 2:D0B0
  --EQP-- 0,1,0,SEARCH TABLE t USING INTEGER PRIMARY KEY (rowid=?)



Regards,
Clemens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Knowing when not to use a R-Tree index, and clustering tables along spatial indexes

Wolfgang Enzinger
Am Mon, 1 Jan 2018 10:45:50 +0100 schrieb Clemens Ladisch:

> Wolfgang Enzinger wrote:
>> First, query the overall extent of your data, like this:
>> SELECT MIN(mingx),MAX(maxgx),MIN(mingy),MAX(maxgy) FROM flst_shape_index;
>
> This results in a full table scan.  Instead of caching these values manually,
> it would be a better idea to read them from the index:
>
>   SELECT rtreenode(2, data) FROM flst_shape_index_node WHERE nodeno = 1;
>
> (rtreenode() is undocumented; maybe you should use your own decoder.)

Thanks, didn't know that, I'll look into it. You're right, my query results
in a full table scan, however it's pretty fast anyway - less than a second
with 160,000 rows and cold cache.

>> Let SQLite know about that likelihood in a JOIN query
>
> This does not appear to change anything with a virtual table:
>
>   CREATE TABLE t(id INETGER PRIMARY KEY, x, [...]);
>   CREATE VIRTUAL TABLE i USING rtree(id, minx, maxx);
>
>   SELECT t.* FROM t JOIN i USING (id) WHERE likelihood(i.minx BETWEEN 10 AND 20, 0.000001);
>   --EQP-- 0,0,1,SCAN TABLE i VIRTUAL TABLE INDEX 2:D0B0
>   --EQP-- 0,1,0,SEARCH TABLE t USING INTEGER PRIMARY KEY (rowid=?)
>   SELECT t.* FROM t JOIN i USING (id) WHERE likelihood(i.minx BETWEEN 10 AND 20, 0.999999);
>   --EQP-- 0,0,1,SCAN TABLE i VIRTUAL TABLE INDEX 2:D0B0
>   --EQP-- 0,1,0,SEARCH TABLE t USING INTEGER PRIMARY KEY (rowid=?)

This is not surprising because only criteria concerning the "i" table are
in effect here. So it is clear that even a likelihood of 0.999999 is more
selective than a likelihood of 1.000 (= no filter criteria in this table).
However, if your query has criteria both in the "i" and the "t" table, it
can make a difference.

Of course, anybody correct me if I'm mistaken.

Happy new year!
Wolfgang

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Knowing when not to use a R-Tree index, and clustering tables along spatial indexes

Clemens Ladisch
Wolfgang Enzinger wrote:

> Am Mon, 1 Jan 2018 10:45:50 +0100 schrieb Clemens Ladisch:
>> Wolfgang Enzinger wrote:
>>> Let SQLite know about that likelihood in a JOIN query
>>
>> This does not appear to change anything with a virtual table:
>>
>>   SELECT t.* FROM t JOIN i USING (id) WHERE likelihood(i.minx BETWEEN 10 AND 20, 0.000001);
>>   SELECT t.* FROM t JOIN i USING (id) WHERE likelihood(i.minx BETWEEN 10 AND 20, 0.999999);
>>   ...
>
> This is not surprising because only criteria concerning the "i" table are
> in effect here. So it is clear that even a likelihood of 0.999999 is more
> selective than a likelihood of 1.000 (= no filter criteria in this table).
> However, if your query has criteria both in the "i" and the "t" table, it
> can make a difference.

It is indeed possible to change the query so that SQLite uses rowid
lookups for the R-tree filter (INDEX 1).  However, any likelihood on the
R-tree search expression still did not make any difference.  Do you have
an example?


Regards,
Clemens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Knowing when not to use a R-Tree index, and clustering tables along spatial indexes

Wolfgang Enzinger
Am Mon, 1 Jan 2018 16:20:21 +0100 schrieb Clemens Ladisch:

> It is indeed possible to change the query so that SQLite uses rowid
> lookups for the R-tree filter (INDEX 1).  However, any likelihood on the
> R-tree search expression still did not make any difference.  Do you have
> an example?

Try:

---

CREATE TABLE t(id INTEGER PRIMARY KEY,x INTEGER,y INTEGER,z INTEGER);
CREATE VIRTUAL TABLE i USING rtree(id,minx,maxx,miny,maxy);
CREATE INDEX t_x ON t(x);

---

EXPLAIN QUERY PLAN
SELECT * FROM t INNER JOIN i USING(id)
 WHERE LIKELIHOOD(i.minX>=-81.08 AND i.maxX<=-80.58
   AND i.minY>=35.00  AND i.maxY<=35.44, 0.999)      -- 0.999
   AND LIKELIHOOD(t.x=3, 0.001);                     -- 0.001

-> SEARCH TABLE t USING INDEX t_x (x=?)
-> SCAN TABLE i VIRTUAL TABLE INDEX 1:

---

SELECT * FROM t INNER JOIN i USING(id)
 WHERE LIKELIHOOD(i.minX>=-81.08 AND i.maxX<=-80.58
   AND i.minY>=35.00  AND i.maxY<=35.44, 0.001)      -- 0.001
   AND LIKELIHOOD(t.x=3, 0.999);                     -- 0.999

-> SCAN TABLE i VIRTUAL TABLE INDEX 2:D0B1D2B3
-> SEARCH TABLE t USING INTEGER PRIMARY KEY (rowid=?)

---

Tested with SQLite 3.13.0 here, but IIRC newer versions behave the same.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Knowing when not to use a R-Tree index, and clustering tables along spatial indexes

Andrea Aime
In reply to this post by Clemens Ladisch
On Mon, Jan 1, 2018 at 10:45 AM, Clemens Ladisch <[hidden email]> wrote:

> Wolfgang Enzinger wrote:
> > First, query the overall extent of your data, like this:
> > SELECT MIN(mingx),MAX(maxgx),MIN(mingy),MAX(maxgy) FROM
> flst_shape_index;
>
> This results in a full table scan.  Instead of caching these values
> manually,
> it would be a better idea to read them from the index:
>
>   SELECT rtreenode(2, data) FROM flst_shape_index_node WHERE nodeno = 1;
>
> (rtreenode() is undocumented; maybe you should use your own decoder.)
>

Intriguing! I'm looking at the output on my local data set (all Texas
roads, 3+ million records, large,
but not all that much in GIS systems):

SELECT rtreenode(2, data) FROM rtree_tiger_shp_geom_node WHERE nodeno = 1;
{26338 -100.598 -96.2836 25.8411 30.6962} {26337 -96.9349 -93.5195 28.598
31.1915} {43295 -96.9621 -93.6133 30.6486 33.9442} {62086 -106.644 -98.0079
28.887 32.1568} {80094 -97.6371 -96.4565 28.5382 33.9419} {97065 -98.5755
-97.2869 28.686 34.1405} {108645 -104.096 -98.3331 31.6511 36.5007}

So, if I'm guessing right, each set of numbers is a count of features and
then the rectangle intersecting them?
I've prepared a picture with the data:

https://drive.google.com/file/d/15WpyfWNKdVeBoLDDRCoCW2xx87T6qf_Z/view?usp=sharing

This query is indeed quite a bit faster and more interesting than scanning
the entire index virtual table, wondering,
is there also a way to get the "next level" of the rtree?
This one returns only a few records, I would not mind having a bit more in
memory for more accurate pre-checks (dabase
access can be optimized out fully if the requested rectangle falls outside
of the rtree boxes no?)

What worries me is that these functions are undocumented, and thus, I
imaging, unsupported and at risk
of being removed at any time. Do you have a feeling of how likely is this
to happen?



>
> > Let SQLite know about that likelihood in a JOIN query
>

The query is being generated after translation from another generic filter
language and at the moment it's
expressed as a sub-query (cause it can be negated, related with other
conditions by and/or, and so on).
I've tried to use it with little luck, the subquery is always run according
to explain plan:

sqlite> explain query plan SELECT "fid","geom" as "geom" FROM "tiger_shp"
WHERE  likelihood("fid" IN (SELECT id FROM "rtree_tiger_shp_geom" r WHERE
r.maxx >= -98.41324970985764 AND r.minx <= -98.33594825643836 AND r.maxy >=
30.532285631231357 AND r.miny <= 30.595068029284644), *0.01*);
0|0|0|SEARCH TABLE tiger_shp USING INTEGER PRIMARY KEY (rowid=?)
0|0|0|EXECUTE LIST SUBQUERY 1
1|0|0|SCAN TABLE rtree_tiger_shp_geom AS r VIRTUAL TABLE INDEX 2:D1B0D3B2
sqlite> explain query plan SELECT "fid","geom" as "geom" FROM "tiger_shp"
WHERE  likelihood("fid" IN (SELECT id FROM "rtree_tiger_shp_geom" r WHERE
r.maxx >= -98.41324970985764 AND r.minx <= -98.33594825643836 AND r.maxy >=
30.532285631231357 AND r.miny <= 30.595068029284644), *0.99*);
0|0|0|SEARCH TABLE tiger_shp USING INTEGER PRIMARY KEY (rowid=?)
0|0|0|EXECUTE LIST SUBQUERY 1
1|0|0|SCAN TABLE rtree_tiger_shp_geom AS r VIRTUAL TABLE INDEX 2:D1B0D3B2

Oh well, since the code is generating the SQL, it can directly omit the
subquery. Is there any literature on what a good threshold
would be?

Cheers
Andrea
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Knowing when not to use a R-Tree index, and clustering tables along spatial indexes

Clemens Ladisch
In reply to this post by Wolfgang Enzinger
Wolfgang Enzinger wrote:

> Am Mon, 1 Jan 2018 16:20:21 +0100 schrieb Clemens Ladisch:
>> It is indeed possible to change the query so that SQLite uses rowid
>> lookups for the R-tree filter (INDEX 1).  However, any likelihood on the
>> R-tree search expression still did not make any difference.  Do you have
>> an example?
>
> SELECT * FROM t INNER JOIN i USING(id)
>  WHERE LIKELIHOOD(i.minX>=-81.08 ..., 0.999)      -- 0.999
>    AND LIKELIHOOD(t.x=3, 0.001);                  -- 0.001
>
> SELECT * FROM t INNER JOIN i USING(id)
>  WHERE LIKELIHOOD(i.minX>=-81.08 ..., 0.001)      -- 0.001
>    AND LIKELIHOOD(t.x=3, 0.999);                  -- 0.999

Sorry, this is not what I meant.

The original question was about manually removing the R-tree search
depending on the spatial window.  So, do you have an example where
the query plan changes due to a difference in *only* the likelihood
of the R-tree search expression?


Regards,
Clemens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Knowing when not to use a R-Tree index, and clustering tables along spatial indexes

Wolfgang Enzinger
Am Mon, 1 Jan 2018 20:30:10 +0100 schrieb Clemens Ladisch:

> Wolfgang Enzinger wrote:
>> Am Mon, 1 Jan 2018 16:20:21 +0100 schrieb Clemens Ladisch:
>>> It is indeed possible to change the query so that SQLite uses rowid
>>> lookups for the R-tree filter (INDEX 1).  However, any likelihood on the
>>> R-tree search expression still did not make any difference.  Do you have
>>> an example?
>>
>> SELECT * FROM t INNER JOIN i USING(id)
>>  WHERE LIKELIHOOD(i.minX>=-81.08 ..., 0.999)      -- 0.999
>>    AND LIKELIHOOD(t.x=3, 0.001);                  -- 0.001
>>
>> SELECT * FROM t INNER JOIN i USING(id)
>>  WHERE LIKELIHOOD(i.minX>=-81.08 ..., 0.001)      -- 0.001
>>    AND LIKELIHOOD(t.x=3, 0.999);                  -- 0.999
>
> Sorry, this is not what I meant.

Seems like I didn't get the original question correctly, then.

> The original question was about manually removing the R-tree search
> depending on the spatial window.  So, do you have an example where
> the query plan changes due to a difference in *only* the likelihood
> of the R-tree search expression?

No.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users