Need advice on using nested selects in JOIN statements as a logical 'AND'

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Need advice on using nested selects in JOIN statements as a logical 'AND'

Tristan Van Berkom
Hi,
   I don't have many years experience with the SQL language
and I've cooked up some pretty complex stuff which will run
in production environments, I just want to confirm with you
that the assumptions I've made are true (I do have a lot of
unit tests which confirm that my code works as far as I can
see).

Some minimum context is required first:

  o My software is a C library which stores an addressbook
    of vCards

  o Most of the searchable contact fields are stored in
    one table, with a PRIMARY KEY column 'uid'

  o As the vCard spec specifies many values can be
    listed for a given key, for instance "TEL" or "EMAIL",
    these are stored for the purpose of quick searches
    on a separate table (one separate table per mult-valued
    datatype).

    Each of these tables has a 'uid' column which is
    created with:
      'uid TEXT NOT NULL REFERENCES the_main_table (uid)'

    So for the table which stores emails, I have one row
    per email address for each contact (multiple rows can
    exist with the same 'uid');

    To speed up searches, I have indexes created on some
    of these auxiliary tables.

  o I need to honor a predetermined contact search API,
    searches can come in for various contact fields at
    a time, for instance:

      "Give me all the contacts who's family name starts
       with J, and only if they have an email address
       ending with .com"

    Is represented as:
        (and (beginswith "family_name" "J")
             (endswith "email" ".com"))

Now the fun starts, I've recently written a query optimizer
of sorts which reorganizes how I will form my query before
generating the SQL to execute.

So let's consider the query input:

   (and (endswith "tel" "0505") (beginswith "email" "eddie"))

Note that in this case, the three tables in context are:
   o folder_id              The main table with unique 'uid'
   o folder_id_phone_list   The table containing all phone numbers
   o folder_id_email_list   The table containing all phone numbers

Before the query optimization, my query would be:
================================================
SELECT DISTINCT summary.uid, summary.vcard FROM 'folder_id' AS summary
LEFT OUTER JOIN 'folder_id_phone_list' AS phone_list
    ON phone_list.uid = summary.uid
LEFT OUTER JOIN 'folder_id_email_list' AS email_list
    ON email_list.uid = summary.uid
WHERE (phone_list.value IS NOT NULL AND phone_list.value LIKE '%0505')
  AND (email_list.value IS NOT NULL AND email_list.value LIKE 'eddie%')

After optimization, will be generated instead like so:
================================================
SELECT DISTINCT summary.uid, summary.vcard FROM (
 SELECT DISTINCT phone_list.uid FROM 'folder_id_phone_list'
 AS phone_list
 WHERE (phone_list.value IS NOT NULL AND
        phone_list.value LIKE '%0505')
) AS phone_results
LEFT OUTER JOIN (
 SELECT DISTINCT email_list.uid FROM 'folder_id_email_list'
 AS email_list
 WHERE (email_list.value IS NOT NULL AND
       email_list.value LIKE 'eddie%')
) AS email_list_results ON email_results.uid = phone_results.uid
LEFT OUTER JOIN 'folder_id' AS summary
  ON summary.uid = email_results.uid WHERE summary.uid IS NOT NULL

Performing the nested selects achieves two things:

   o Reduce the number of rows considered in the JOIN statements

   o Leverage the index which I've created on 'folder_id_email_list'
     (I am using case insensitive LIKE statements so the indexes
     work in that statement).

The main assumption that I'm making, is the toplevel logical AND
statements (i.e. <email constraint> AND <phone constraint>),
can be transformed into a nested SELECT with the LEFT OUTER JOIN
each time "ON this_select.uid = previous_select.uid", effectively
filtering out any rows which didn't match either of the
constraints.

As far as I can tell, and I've been trying to absorb these
JOIN concepts to the best of my ability... the above two
statements will always report the same results, i.e. they
are equal statements for all practical purposes, only
the second one is much faster than the first one.

And while my test cases tell me so far that I am correct
in my assumptions, I'd really like to have some insight
from a third party authority on the matter.

Can someone please either confirm that my logic is sound
in this assumption ? Or point out to me perhaps some case
where my logic might break down ?

Please excuse the long detailed email, I don't think I could
have explained this accurately without providing enough
context.

Kind Regards,

    -Tristan


_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Need advice on using nested selects in JOIN statements as a logical 'AND'

Igor Tandetnik-2
On 11/27/2013 11:52 PM, Tristan Van Berkom wrote:
> ================================================
> SELECT DISTINCT summary.uid, summary.vcard FROM 'folder_id' AS summary
> LEFT OUTER JOIN 'folder_id_phone_list' AS phone_list
>      ON phone_list.uid = summary.uid
> LEFT OUTER JOIN 'folder_id_email_list' AS email_list
>      ON email_list.uid = summary.uid
> WHERE (phone_list.value IS NOT NULL AND phone_list.value LIKE '%0505')
>    AND (email_list.value IS NOT NULL AND email_list.value LIKE 'eddie%')

Why are you using outer joins when your WHERE clause discards unmatched
records anyway? If you replace LEFT OUTER with INNER, the end result
would be exactly the same.

I have a strong feeling that, once you replace outer joins with regular
joins, this statement would run just as fast as your convoluted one with
nested selects. By using outer joins, you prevent SQLite from reordering
the conditions and using the most efficient search strategy - it is
forced to perform those joins left to right, which results in bad
performance because, apparently, you don't have indexes on
phone_list.uid or email_list.uid.

Try this straightforward query:

SELECT DISTINCT summary.uid, summary.vcard FROM folder_id AS summary
JOIN folder_id_phone_list AS phone_list
     ON phone_list.uid = summary.uid
JOIN folder_id_email_list AS email_list
     ON email_list.uid = summary.uid
WHERE phone_list.value LIKE '%0505'  AND email_list.value LIKE 'eddie%';

>     o Leverage the index which I've created on 'folder_id_email_list'
>       (I am using case insensitive LIKE statements so the indexes
>       work in that statement).

Normally, you need case-sensitive LIKE in order to use the index, unless
the index is created with COLLATE NOCASE. You could use EXPLAIN QUERY
PLAN to confirm that the index is indeed being utilized.
--
Igor Tandetnik

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Need advice on using nested selects in JOIN statements as a logical 'AND'

Tristan Van Berkom
On Thu, 2013-11-28 at 00:20 -0500, Igor Tandetnik wrote:

> On 11/27/2013 11:52 PM, Tristan Van Berkom wrote:
> > ================================================
> > SELECT DISTINCT summary.uid, summary.vcard FROM 'folder_id' AS summary
> > LEFT OUTER JOIN 'folder_id_phone_list' AS phone_list
> >      ON phone_list.uid = summary.uid
> > LEFT OUTER JOIN 'folder_id_email_list' AS email_list
> >      ON email_list.uid = summary.uid
> > WHERE (phone_list.value IS NOT NULL AND phone_list.value LIKE '%0505')
> >    AND (email_list.value IS NOT NULL AND email_list.value LIKE 'eddie%')
>
> Why are you using outer joins when your WHERE clause discards unmatched
> records anyway? If you replace LEFT OUTER with INNER, the end result
> would be exactly the same.

I was afraid someone might ask this, but I had already put a lot
of detail into this email so I left it out.

When using an INNER join, the engine does something like this:

  o Create a data set that is table_1 * table_2 * table_3 rows
    large

  o Run the constraints on what might be multiple matching rows
    in the resulting huge data set (even if I nest the selects,
    there can be other constraints to sort out on the main table).

This is extremely slow for addressbooks with 200,000 contacts,
however the INNER join allows me to drop the nested select foolery
for smaller addressbooks as the indexes can still be leveraged
in a query similar to the first version above.

This bug comment has a good detailed description of the reason
why we shifted from regular joins to LEFT OUTER joins:
    https://bugzilla.gnome.org/show_bug.cgi?id=699597#c6


> I have a strong feeling that, once you replace outer joins with regular
> joins, this statement would run just as fast as your convoluted one with
> nested selects. By using outer joins, you prevent SQLite from reordering
> the conditions and using the most efficient search strategy - it is
> forced to perform those joins left to right, which results in bad
> performance because, apparently, you don't have indexes on
> phone_list.uid or email_list.uid.

Indeed I avoided creating an index for phone_list.uid and
email_list.uid, and indeed I'm unsure of how that will effect
performance.

If I were to create indexes on the uid column of the auxiliary
tables, would that cause the INNER join to not create such a
huge dataset before checking the constraints ?

> Try this straightforward query:
>
> SELECT DISTINCT summary.uid, summary.vcard FROM folder_id AS summary
> JOIN folder_id_phone_list AS phone_list
>      ON phone_list.uid = summary.uid
> JOIN folder_id_email_list AS email_list
>      ON email_list.uid = summary.uid
> WHERE phone_list.value LIKE '%0505'  AND email_list.value LIKE 'eddie%';

This looks stunningly similar to what we had originally, which indeed
returns results in less than 10ms for addressbooks up to around 800
contacts.

This graph shows performance of INNER joins actually:
http://blogs.gnome.org/tvb/files/2013/01/filter-by-long-phone-number-prefix.png

As you can see, things start to get bad with > 800 contacts (the red
line is what we're looking at here).

>
> >     o Leverage the index which I've created on 'folder_id_email_list'
> >       (I am using case insensitive LIKE statements so the indexes
> >       work in that statement).
>
> Normally, you need case-sensitive LIKE in order to use the index, unless
> the index is created with COLLATE NOCASE. You could use EXPLAIN QUERY
> PLAN to confirm that the index is indeed being utilized.

Oh sorry, I totally mistyped that, LIKE is case insensitive by default
and we override that indeed, using "PRAGMA case_sensitive_like=ON" at
initialization time.

Thank you for taking the time to try and understand this with me :)

Cheers,
    -Tristan


_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Need advice on using nested selects in JOIN statements as a logical 'AND'

Tristan Van Berkom
In reply to this post by Igor Tandetnik-2
On Thu, 2013-11-28 at 00:20 -0500, Igor Tandetnik wrote:

> On 11/27/2013 11:52 PM, Tristan Van Berkom wrote:
> > ================================================
> > SELECT DISTINCT summary.uid, summary.vcard FROM 'folder_id' AS summary
> > LEFT OUTER JOIN 'folder_id_phone_list' AS phone_list
> >      ON phone_list.uid = summary.uid
> > LEFT OUTER JOIN 'folder_id_email_list' AS email_list
> >      ON email_list.uid = summary.uid
> > WHERE (phone_list.value IS NOT NULL AND phone_list.value LIKE '%0505')
> >    AND (email_list.value IS NOT NULL AND email_list.value LIKE 'eddie%')
>
> Why are you using outer joins when your WHERE clause discards unmatched
> records anyway? If you replace LEFT OUTER with INNER, the end result
> would be exactly the same.
>
> I have a strong feeling that, once you replace outer joins with regular
> joins, this statement would run just as fast as your convoluted one with
> nested selects. By using outer joins, you prevent SQLite from reordering
> the conditions and using the most efficient search strategy - it is
> forced to perform those joins left to right, which results in bad
> performance because, apparently, you don't have indexes on
> phone_list.uid or email_list.uid.

Actually... I just replied to this email but forgot to mention, while
your input is really appreciated, it would be really great if you could
answer the actual question :)

I.e. is the statement logically the same ?

>
> Try this straightforward query:
>
> SELECT DISTINCT summary.uid, summary.vcard FROM folder_id AS summary
> JOIN folder_id_phone_list AS phone_list
>      ON phone_list.uid = summary.uid
> JOIN folder_id_email_list AS email_list
>      ON email_list.uid = summary.uid
> WHERE phone_list.value LIKE '%0505'  AND email_list.value LIKE 'eddie%';
>
> >     o Leverage the index which I've created on 'folder_id_email_list'
> >       (I am using case insensitive LIKE statements so the indexes
> >       work in that statement).
>
> Normally, you need case-sensitive LIKE in order to use the index, unless
> the index is created with COLLATE NOCASE. You could use EXPLAIN QUERY
> PLAN to confirm that the index is indeed being utilized.


_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Need advice on using nested selects in JOIN statements as a logical 'AND'

Clemens Ladisch
In reply to this post by Tristan Van Berkom
Tristan Van Berkom wrote:
> When using an INNER join, the engine does something like this:
>
>   o Create a data set that is table_1 * table_2 * table_3 rows
>     large
>
>   o Run the constraints on what might be multiple matching rows
>     in the resulting huge data set (even if I nest the selects,
>     there can be other constraints to sort out on the main table).

This is wrong; constraints on the outer table are checked before records
from the inner table are searched.

> This bug comment has a good detailed description of the reason
> why we shifted from regular joins to LEFT OUTER joins:
>     https://bugzilla.gnome.org/show_bug.cgi?id=699597#c6

That query was slow because it did not do any join to begin with,
,not even with "a.id=b.id" in the WHERE clause; instead, lots of
constraints were combined with OR.

> If I were to create indexes on the uid column of the auxiliary
> tables, would that cause the INNER join to not create such a
> huge dataset before checking the constraints ?

I might or might not make a difference; check with EXPLAIN QUERY PLAN.

>> WHERE phone_list.value LIKE '%0505'

In theory, you could enable index usage by using:

   WHERE phone_list.value_reversed LIKE '5050%'

Not sure if this would be worth the effort.

>> Normally, you need case-sensitive LIKE in order to use the index, unless
>> the index is created with COLLATE NOCASE.

Also, the column must have text affinity.

> LIKE is case insensitive by default and we override that indeed, using
> "PRAGMA case_sensitive_like=ON" at initialization time.

To avoid that, you could use "GLOB 'foo*'" instead of "LIKE 'foo%'".


Regards,
Clemens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Need advice on using nested selects in JOIN statements as a logical 'AND'

Tristan Van Berkom
On Thu, 2013-11-28 at 09:43 +0100, Clemens Ladisch wrote:

> Tristan Van Berkom wrote:
> > When using an INNER join, the engine does something like this:
> >
> >   o Create a data set that is table_1 * table_2 * table_3 rows
> >     large
> >
> >   o Run the constraints on what might be multiple matching rows
> >     in the resulting huge data set (even if I nest the selects,
> >     there can be other constraints to sort out on the main table).
>
> This is wrong; constraints on the outer table are checked before records
> from the inner table are searched.
>
> > This bug comment has a good detailed description of the reason
> > why we shifted from regular joins to LEFT OUTER joins:
> >     https://bugzilla.gnome.org/show_bug.cgi?id=699597#c6
>
> That query was slow because it did not do any join to begin with,
> ,not even with "a.id=b.id" in the WHERE clause; instead, lots of
> constraints were combined with OR.
>
> > If I were to create indexes on the uid column of the auxiliary
> > tables, would that cause the INNER join to not create such a
> > huge dataset before checking the constraints ?
>
> I might or might not make a difference; check with EXPLAIN QUERY PLAN.

Yes I will have to do more research, we'll see.

>
> >> WHERE phone_list.value LIKE '%0505'
>
> In theory, you could enable index usage by using:
>
>    WHERE phone_list.value_reversed LIKE '5050%'
>
> Not sure if this would be worth the effort.

It is, and we store data in reverse to speed up suffix
searches in some cases (if configured to do so).

>
> >> Normally, you need case-sensitive LIKE in order to use the index, unless
> >> the index is created with COLLATE NOCASE.
>
> Also, the column must have text affinity.
>
> > LIKE is case insensitive by default and we override that indeed, using
> > "PRAGMA case_sensitive_like=ON" at initialization time.
>
> To avoid that, you could use "GLOB 'foo*'" instead of "LIKE 'foo%'".

However I found the escape sequences to be more easy with LIKE
statements, so better not to avoid that.

Again, while I appreciate the comments about how we think the
best way SQLite should work in the most optimal way, does anyone
have an answer to the question ?

Are the JOIN statements equal to the logical AND statements,
for all practical purposes ?

Best Regards,
    -Tristan




_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Need advice on using nested selects in JOIN statements as a logical 'AND'

Clemens Ladisch
Tristan Van Berkom wrote:
> Are the JOIN statements equal to the logical AND statements,

Yes.

> for all practical purposes ?

If you drop all those superfluous LEFT OUTER and IS NOT NULL parts,
the database will be able to optimize the first query (the one
without subqueries) better.


Regards,
Clemens
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Need advice on using nested selects in JOIN statements as a logical 'AND'

Tristan Van Berkom
On Thu, 2013-11-28 at 12:11 +0100, Clemens Ladisch wrote:
> Tristan Van Berkom wrote:
> > Are the JOIN statements equal to the logical AND statements,
>
> Yes.
>

Thank you.

> > for all practical purposes ?
>
> If you drop all those superfluous LEFT OUTER and IS NOT NULL parts,
> the database will be able to optimize the first query (the one
> without subqueries) better.

Yes, I definitely agree that on a conceptual level, I should not
have to consider the pre-optimization of my own query before
launching it. As a functional language, I should only have to
describe the query and let SQLite do a better optimization.

My skills in writing SQL are (already admittedly) lacking and
so, after months (of myself, and other competent engineers
involved), I ended up with this convoluted nested select thing.

I will definitely try this solution as it will significantly
simplify my query generator, but I'll need to benchmark it
with big data sets before I'm satisfied, of course.

Thanks for your help,

    -Tristan


_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Need advice on using nested selects in JOIN statements as a logical 'AND'

Simon Slavin-3

On 28 Nov 2013, at 11:22am, Tristan Van Berkom <[hidden email]> wrote:

> Yes, I definitely agree that on a conceptual level, I should not
> have to consider the pre-optimization of my own query before
> launching it. As a functional language, I should only have to
> describe the query and let SQLite do a better optimization.

While you're concerned about optimization, you might look at your column definitions.  For instance, you are storing email addresses in a column.  Since email addresses are not case sensitive you might want to define this column as

COLLATE NOCASE

This will mean that any indexing on this column will ignore case, and any comparisons of text with text in this column will ignore case by default.  It can simplify all other SQL statements that refer to that column.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Need advice on using nested selects in JOIN statements as a logical 'AND'

Tristan Van Berkom
On Thu, 2013-11-28 at 12:19 +0000, Simon Slavin wrote:

> On 28 Nov 2013, at 11:22am, Tristan Van Berkom <[hidden email]> wrote:
>
> > Yes, I definitely agree that on a conceptual level, I should not
> > have to consider the pre-optimization of my own query before
> > launching it. As a functional language, I should only have to
> > describe the query and let SQLite do a better optimization.
>
> While you're concerned about optimization, you might look at your column definitions.  For instance, you are storing email addresses in a column.  Since email addresses are not case sensitive you might want to define this column as
>
> COLLATE NOCASE
>
> This will mean that any indexing on this column will ignore case, and any comparisons of text with text in this column will ignore case by default.  It can simplify all other SQL statements that refer to that column.
>

That's not an issue, our matching API is completely case insensitive,
and all data which we insert into columns for searches is normalized
in advance (excepting the vCard UID field)... or normalized and reversed
in the case we configure for quick suffix matching.

Thanks for the tip anyway, though :)

Cheers,
    -Tristan



_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Need advice on using nested selects in JOIN statements as a logical 'AND'

Igor Tandetnik-2
In reply to this post by Tristan Van Berkom
On 11/28/2013 2:40 AM, Tristan Van Berkom wrote:
> I.e. is the statement logically the same ?

Yes, I do believe that the two queries shown in your original post are
logically equivalent.
--
Igor Tandetnik

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Need advice on using nested selects in JOIN statements as a logical 'AND'

Tristan Van Berkom
In reply to this post by Igor Tandetnik-2
On Thu, 2013-11-28 at 00:20 -0500, Igor Tandetnik wrote:

> On 11/27/2013 11:52 PM, Tristan Van Berkom wrote:
> > ================================================
> > SELECT DISTINCT summary.uid, summary.vcard FROM 'folder_id' AS summary
> > LEFT OUTER JOIN 'folder_id_phone_list' AS phone_list
> >      ON phone_list.uid = summary.uid
> > LEFT OUTER JOIN 'folder_id_email_list' AS email_list
> >      ON email_list.uid = summary.uid
> > WHERE (phone_list.value IS NOT NULL AND phone_list.value LIKE '%0505')
> >    AND (email_list.value IS NOT NULL AND email_list.value LIKE 'eddie%')
>
> Why are you using outer joins when your WHERE clause discards unmatched
> records anyway? If you replace LEFT OUTER with INNER, the end result
> would be exactly the same.
>
> I have a strong feeling that, once you replace outer joins with regular
> joins, this statement would run just as fast as your convoluted one with
> nested selects. By using outer joins, you prevent SQLite from reordering
> the conditions and using the most efficient search strategy - it is
> forced to perform those joins left to right, which results in bad
> performance because, apparently, you don't have indexes on
> phone_list.uid or email_list.uid.
>
> Try this straightforward query:
>
> SELECT DISTINCT summary.uid, summary.vcard FROM folder_id AS summary
> JOIN folder_id_phone_list AS phone_list
>      ON phone_list.uid = summary.uid
> JOIN folder_id_email_list AS email_list
>      ON email_list.uid = summary.uid
> WHERE phone_list.value LIKE '%0505'  AND email_list.value LIKE 'eddie%';

This has been a great help, I've run benchmarks and this does perform
well with large data sets (large for addressbooks at least).

But I have now one more question.

I have a tradeoff that depends on whether I create an index
on email_list.uid or not.

The reason it was important to create an index on the uid
column (related to the primary key in the main summary table),
is because it speeds up inserts.

Maybe my insert statements are wrong, what I do at insert time
is first:
   DELETE FROM 'email_list' WHERE uid = 'the uid to remove'

And then I run the normal INSERT OR (REPLACE/FAIL) INTO
the summary table, and then populate the email_list table
again (it's important to replace all the rows in the email_list
table with the new ones from the new vCard).

When we have huge books, the index on email_list.uid helps
a lot.

However, in a statement formed like the one you proposed
above, it screws with the query optimizer in SQLite I suspect,
i.e. when searching for a prefix on an email address,
SQLite (I suspect) decides to prioritize the UID index instead
of the perfectly good (much better even) index on email_list.value.

For exact matches, it would seem email_list.value is correctly
chosen for index traversal, for prefix matches, SQLite seems to
want to just prioritize the UID.

Here are my results:

Prefix matches on the email_list.value column:
  https://people.gnome.org/~tvb/filter-by-short-email-address-prefix.png

Exact matches on the email_list.value column:
  https://people.gnome.org/~tvb/filter-by-email-address.png

Inserting contacts:
  https://people.gnome.org/~tvb/contact-saving.png

The green line you can ignore, we're interested in "Master" and
"Experimental".

Master adds no index to the email_list.uid column

Experimental adds an index to the email_list.uid column

So, is there a way that I can tell SQLite forcibly to
prioritize the index on email_list.value when making
a prefix match ?

Would an explicit "COLLATE email_list.value" placed
somewhere in the above query help ?

Thanks everyone for your help, I've come a long way with
the advice I've gotten from this list.

Best,
    -Tristan


>
> >     o Leverage the index which I've created on 'folder_id_email_list'
> >       (I am using case insensitive LIKE statements so the indexes
> >       work in that statement).
>
> Normally, you need case-sensitive LIKE in order to use the index, unless
> the index is created with COLLATE NOCASE. You could use EXPLAIN QUERY
> PLAN to confirm that the index is indeed being utilized.


_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Need advice on using nested selects in JOIN statements as a logical 'AND'

Igor Tandetnik-2
On 11/30/2013 12:40 PM, Tristan Van Berkom wrote:
> However, in a statement formed like the one you proposed
> above, it screws with the query optimizer in SQLite I suspect,
> i.e. when searching for a prefix on an email address,
> SQLite (I suspect) decides to prioritize the UID index instead
> of the perfectly good (much better even) index on email_list.value.

You can suppress the use of this index by writing " ON +email_list.uid =
summary.uid ". Note the unary plus. It doesn't change the meaning of the
expression, but makes it ineligible for participating in an index.
--
Igor Tandetnik

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Need advice on using nested selects in JOIN statements as a logical 'AND'

Donald Griggs
Tristan,

My apologies to you and the list if you mentioned this earlier, but I
assume you've run the analyze command on your database, right?

http://www.sqlite.org/lang_analyze.html

Also possibly relevant:  http://www.sqlite.org/compile.html#enable_stat3

(Of course, Igor's suggestion of the uninary "+" to sent a hint to the
sqlite query planner gives an immediate solution to your particular
question.)
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Need advice on using nested selects in JOIN statements as a logical 'AND'

Simon Slavin-3
In reply to this post by Tristan Van Berkom

On 30 Nov 2013, at 5:40pm, Tristan Van Berkom <[hidden email]> wrote:

> So, is there a way that I can tell SQLite forcibly to
> prioritize the index on email_list.value when making
> a prefix match ?

Don't use LIKE or GLOB for prefix matches.  Although you as a human can tell that

>>> email_list.value LIKE 'eddie%'

is a prefix match, all the computer sees is pattern-matching.  This makes it try all the available combinations.  Instead use

email_list.value BETWEEN 'eddie' AND 'eddie}'

(I chose the close curly bracket because it has a very high code and will sort near the end of the possibiliites.)

Using BETWEEN allows SQLite to use any available index for searching and sorting.  In this case it's equivalent to saying "We don't have to look through all 120 pages of this book, we need only pages 34 to 49.".  In the case of the SELECT you show, modified for my suggestion, a good index would be something like

CREATE INDEX fiel_uv ON folder_id_email_list (uid, value)

This allows a specific match on a uid and then partial matching on value, which is what it wants.  This can replace the existing index you mention.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Need advice on using nested selects in JOIN statements as a logical 'AND'

Tristan Van Berkom
On Sun, 2013-12-01 at 00:40 +0000, Simon Slavin wrote:

> On 30 Nov 2013, at 5:40pm, Tristan Van Berkom <[hidden email]> wrote:
>
> > So, is there a way that I can tell SQLite forcibly to
> > prioritize the index on email_list.value when making
> > a prefix match ?
>
> Don't use LIKE or GLOB for prefix matches.  Although you as a human can tell that
>
> >>> email_list.value LIKE 'eddie%'
>
> is a prefix match, all the computer sees is pattern-matching.  This makes it try all the available combinations.  Instead use
>
> email_list.value BETWEEN 'eddie' AND 'eddie}'
>
> (I chose the close curly bracket because it has a very high code and will sort near the end of the possibiliites.)
>

I'm not comfortable with this.

Do you have a suggestion which does not make an assumption about what
is the highest value of a character ?

Perhaps there is a definite and clear way to find the highest possible
character ?

Using case sensitive LIKE statements *do* allow SQLite to traverse the
index on email_list.value, perhaps not as optimized as the trick you
propose but at least it does not compromise the user facing API.

This sounds like the kind of trick you can get away with when you
are in control of the data which is added to your DB, and the query
terms which will be issued against it (neither of these apply
in my case).

> Using BETWEEN allows SQLite to use any available index for searching and sorting.  In this case it's equivalent to saying "We don't have to look through all 120 pages of this book, we need only pages 34 to 49.".  In the case of the SELECT you show, modified for my suggestion, a good index would be something like
>
> CREATE INDEX fiel_uv ON folder_id_email_list (uid, value)
>
> This allows a specific match on a uid and then partial matching on value, which is what it wants.  This can replace the existing index you mention.
>

Will this work without using the BETWEEN statement you describe above ?
(I think this is orthogonal to the LIKE statement, but I should still
ask to be sure).

Would this replace:

  o The index created on email_list.uid, for the purpose of
    optimizing the DELETES involved in replacing existing entries ?

  o The index on email_list.value for optimized prefix searches
    and exact matches ?

  o Both of the above ?

I suppose with this I could just remove the '+' from:
   " ON +email_list.uid = summary.uid " ?

This fancy index definitely interests me, and looks like it
would do the right thing to optimize phone number or email
matches for addressbooks of over 100,000 contacts.

However, as you can see it's not much of an issue at this point:
https://people.gnome.org/~tvb/eds-benchmarks-30-11-2013/filter-by-short-email-address-prefix.png

Even with 200,000 contacts, the short email prefix match is now
down to a couple milliseconds.

I would like to flatten that red line a bit more, but it would
be purely to satisfy my sense of perfectionism ;-)


Thank you very much for sharing your knowledge with me :)

Best,
    -Tristan

> Simon.
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Need advice on using nested selects in JOIN statements as a logical 'AND'

Simon Slavin-3

On 1 Dec 2013, at 4:51am, Tristan Van Berkom <[hidden email]> wrote:

> Do you have a suggestion which does not make an assumption about what
> is the highest value of a character ?

No.  The trick is common in many computer languages and I've never seen a good formulaic way of doing that.  You generally see that the programmer has picked some high-placed character.

It provides /huge/ increases in speed in big databases so it can be worth doing, but it may not make a big difference in your case.

> Will this work without using the BETWEEN statement you describe above ?

In SQLite

a BETWEEN b AND c

is exactly equivalent to

a >= b AND a <= c

In fact I believe that the substitution is made at the parsing level rather than involving low-level changes to planning.  So you can just consider it two separate restrictions.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Need advice on using nested selects in JOIN statements as a logical 'AND'

Igor Tandetnik-2
In reply to this post by Simon Slavin-3
On 11/30/2013 7:40 PM, Simon Slavin wrote:
> Don't use LIKE or GLOB for prefix matches.  Although you as a human can tell that
>
>>>> email_list.value LIKE 'eddie%'
>
> is a prefix match, all the computer sees is pattern-matching.

False. See http://sqlite.org/optoverview.html#like_opt
--
Igor Tandetnik

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Need advice on using nested selects in JOIN statements as a logical 'AND'

Igor Tandetnik-2
In reply to this post by Tristan Van Berkom
On 11/30/2013 11:51 PM, Tristan Van Berkom wrote:
> Perhaps there is a definite and clear way to find the highest possible
> character ?

U+10FFFF, in a Unicode encoding of your choice.
--
Igor Tandetnik

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Need advice on using nested selects in JOIN statements as a logical \'AND\'

David Woodhouse
In reply to this post by Igor Tandetnik-2

On Wed, 27 Nov 2013 at 21:21:43 -0800, Igor Tandetnik wrote
> Why are you using outer joins when your WHERE clause discards
> unmatched records anyway? If you replace LEFT OUTER with INNER, the
> end result would be exactly the same.

Not for all queries.

Consider the query
 (or (beginswith "full_name" "foo") (beginswith "email" "foo"))

The full_name field is in the main table, since each vcard only has
*one* full_name field. There can be multiple email addresses, so those
are in an auxiliary table.

The SQL query becomes:

 SELECT DISTINCT summary.uid, summary.vcard from 'folder_id' AS summary
 LEFT OUTER JOIN 'folder_id_email_list' AS email_list
     ON email_list.uid = summary.uid
 WHERE (email_list.value like 'foo%')
    OR (summary.full_name like 'foo%');

If that *isn't* a LEFT OUTER JOIN, and there's a record in the main
'folder_id' table which doesn't have *any* email addresses, then it will
be incorrectly omitted from the results.

As it happens, the performance on the LEFT OUTER JOIN sucks somewhat,
since it doesn't use any indices on folder_id(full_name):

0|0|0|SCAN TABLE folder_id AS summary USING INDEX sqlite_autoindex_folder_id_1
0|1|1|SEARCH TABLE folder_id_email_list AS email_list USING INDEX UID_INDEX_email_folder_id (uid=?)
0|0|0|USE TEMP B-TREE FOR DISTINCT

There's an optimisation there which maybe it would be nice if sqlite
would see for itself. It works a *whole* lot faster if we express it
thus:

 SELECT DISTINCT summary.uid, summary.vcard from 'folder_id' as summary
 JOIN 'folder_id_email_list' as email_list
     ON +email_list.uid = summary.uid
  WHERE (email_list.value like 'foo%')
UNION
 SELECT summary.uid, summary.vcard from 'folder_id' as summary
  WHERE (summary.full_name like 'foo%');

(Note that this time it *can* be an INNER JOIN.)

1|0|1|SEARCH TABLE folder_id_email_list AS email_list USING INDEX INDEX_email_folder_id (value>? AND value<?)
1|1|0|SEARCH TABLE folder_id AS summary USING INDEX sqlite_autoindex_folder_id_1 (uid=?)
2|0|0|SEARCH TABLE folder_id AS summary USING INDEX INDEX_full_name_folder_id (full_name>? AND full_name<?)
0|0|0|COMPOUND SUBQUERIES 1 AND 2 USING TEMP B-TREE (UNION)

FWIW on my data set of about 240,000 records from the company address
book, the top query runs in about 1400ms while the last one takes about
6ms. Since this is the query that happens when we start typing into the
To: or Cc: field of an email composer window, we want it to be fast :)

Tristan's original query was about using nested SELECT to implement
logical 'AND'. I don't care much about that, but I sure as hell want him
to use UNION to implement logical 'OR'.

Unless there's a better way? Should we be asking for sqlite to actually
spot the optimisation for itself?

--
David Woodhouse                            Open Source Technology Centre
[hidden email]                              Intel Corporation

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
12