why is count of rows in LEFT JOIN more than the rows in the left table

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

why is count of rows in LEFT JOIN more than the rows in the left table

Puneet Kishor-2
Could someone please shed some light on this mystery for me?

        1. SELECT Count(*) FROM a;
                329686

        2. SELECT Count(*) FROM a WHERE a.col IS NULL;
                3

        3. SELECT Count(*) FROM b;
                6268

        4. SELECT Count(*) FROM b WHERE b.col IS NULL;
                0

        5. SELECT Count(*) FROM a LEFT JOIN b ON a.col = b.col;
                329743

        6. SELECT Count(*) FROM a LEFT JOIN b ON a.col = b.col WHERE a.col IS NOT NULL;
                329740

Of course, given #5, #6 is as expected because of #3.

But, why is #5 329743 and not 329686? Where are the extra 57 rows coming from? I thought a LEFT JOIN was supposed to include *all* the rows from the left table with NULLs for the columns of the right table where there was no match. So, at most, the resulting table would have as many rows as the rows in the left table. Certainly not more.

Seems like my understanding is wrong.


--
Puneet Kishor


_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: why is count of rows in LEFT JOIN more than the rows in the left table

Dan Kennedy-4
On 02/20/2012 10:16 PM, Puneet Kishor wrote:

> Could someone please shed some light on this mystery for me?
>
> 1. SELECT Count(*) FROM a;
> 329686
>
> 2. SELECT Count(*) FROM a WHERE a.col IS NULL;
> 3
>
> 3. SELECT Count(*) FROM b;
> 6268
>
> 4. SELECT Count(*) FROM b WHERE b.col IS NULL;
> 0
>
> 5. SELECT Count(*) FROM a LEFT JOIN b ON a.col = b.col;
> 329743
>
> 6. SELECT Count(*) FROM a LEFT JOIN b ON a.col = b.col WHERE a.col IS NOT NULL;
> 329740
>
> Of course, given #5, #6 is as expected because of #3.
>
> But, why is #5 329743 and not 329686? Where are the extra 57 rows coming from? I thought a LEFT JOIN was supposed to include *all* the rows from the left table with NULLs for the columns of the right table where there was no match. So, at most, the resulting table would have as many rows as the rows in the left table. Certainly not more.
>
> Seems like my understanding is wrong.

Perhaps there exist some a.col values for which there is
more than one row in b where (a.col=b.col).
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: why is count of rows in LEFT JOIN more than the rows inthe left table

Igor Tandetnik
In reply to this post by Puneet Kishor-2
Puneet Kishor <[hidden email]> wrote:
> But, why is #5 329743 and not 329686? Where are the extra 57 rows coming from? I thought a LEFT JOIN was supposed to include
> *all* the rows from the left table with NULLs for the columns of the right table where there was no match. So, at most, the
> resulting table would have as many rows as the rows in the left table. Certainly not more.  

The conclusion doesn't follow from the premise. Yes, a row from the left table is included in the resultset even if there are no matching rows in the right table. Which means that the resultset would contain *at least* as many rows as there are in the left table - but it certainly may contain more. For each row in the left table that has N matching rows in the right table (with N > 0), there will be N rows in the resultset.
--
Igor Tandetnik

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: why is count of rows in LEFT JOIN more than the rows inthe left table

Puneet Kishor-2

On Feb 20, 2012, at 9:34 AM, Igor Tandetnik wrote:

> Puneet Kishor <[hidden email]> wrote:
>> But, why is #5 329743 and not 329686? Where are the extra 57 rows coming from? I thought a LEFT JOIN was supposed to include
>> *all* the rows from the left table with NULLs for the columns of the right table where there was no match. So, at most, the
>> resulting table would have as many rows as the rows in the left table. Certainly not more.  
>
> The conclusion doesn't follow from the premise. Yes, a row from the left table is included in the resultset even if there are no matching rows in the right table. Which means that the resultset would contain *at least* as many rows as there are in the left table - but it certainly may contain more. For each row in the left table that has N matching rows in the right table (with N > 0), there will be N rows in the resultset.

Thanks for the great explanation. Now, the follow up question --

is there are way to force the query to return only the "first" match so the resulting query has only as many rows as the left table?



--
Puneet Kishor

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: why is count of rows in LEFT JOIN more than the rows inthe left table

Simon Slavin-3

On 20 Feb 2012, at 4:11pm, Puneet Kishor <[hidden email]> wrote:

> is there are way to force the query to return only the "first" match so the resulting query has only as many rows as the left table?

You have two problems:

A) You are JOINing two columns which contain some null values.  null is interpreted as 'I don't know', so the joins tend to match many different rows.

B) You are using a LEFT JOIN which is explicitly defined to include many matches, not just one match.  Instead you might try NATURAL JOIN or CROSS JOIN to see what difference they make.

And you have a meta-problem too: the dataset you're experimenting with is so big you can't see why it's doing what it's doing.  It might be useful for you to make up a little test database with perhaps 5 rows in each table, just so you can look through every row and see what's going on.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: why is count of rows in LEFT JOIN more than the rows inthe left table

Jay Kreibich
On Mon, Feb 20, 2012 at 05:01:17PM +0000, Simon Slavin scratched on the wall:

>
> On 20 Feb 2012, at 4:11pm, Puneet Kishor <[hidden email]> wrote:
>
> > is there are way to force the query to return only the "first" match
> > so the resulting query has only as many rows as the left table?
>
> You have two problems:
>
> A) You are JOINing two columns which contain some null values.  null is
> interpreted as 'I don't know', so the joins tend to match many
> different rows.

  No, that's not how it works.  If you have a join condition like
  (A.col == B.col), and either A.col or B.col (or both) is NULL, then
  the result of the comparison itself is NULL ("Are these values
  equal?"  "I don't know.").  JOIN operation will only match rows where
  the join condition is TRUE.  NULL is not "good enough" to match rows.

  Think about NATURAL JOINs where the two columns are reduced to only
  one column.  This reduction can only take place if the values are
  exactly the same.

> B) You are using a LEFT JOIN which is explicitly defined to include
> many matches, not just one match.  Instead you might try NATURAL JOIN
> or CROSS JOIN to see what difference they make.

  That's not the issue.  Or rather, it is, but that isn't going to show
  us anything we don't already know.  NATURAL JOIN should return the
  exact same results as example #5: 329743  (assuming "col" is the only
  column name shared between tables), except it will merge the two
  "col" columns into a single column.

  A CROSS JOIN has no join condition and is going to return
  2,066,471,848 rows.  That tells us nothing.

> And you have a meta-problem too: the dataset you're experimenting with
> is so big you can't see why it's doing what it's doing.

  Databases are designed to work with sets.  If you want to debug the
  issue, you need to think in sets as well.  Use the database to answer
  your questions, rather than just scanning the results and doing it in
  your head.

  The statement...

    SELECT COUNT(*) as ct, col FROM a GROUP BY 2 HAVING ct > 1

  ...will tell you which rows have duplicate columns.  You can run this
  on either table.

  If you want to have unique values in the JOIN, the best approach is
  to put a unique index across each "col" column, so that the columns
  have unique values.

  Failing that, replace the table reference in example #6 with
  something like "(SELECT * FROM a GROUP BY col) as a".  That's going
  to give you random rows (there is no concept of a "first" row), but
  it will get you the row count you're looking for.

  You might need to do it with both tables, depending on which table(s)
  has the duplicate "col" values.


   -j

--
Jay A. Kreibich < J A Y  @  K R E I B I.C H >

"Intelligence is like underwear: it is important that you have it,
 but showing it to the wrong people has the tendency to make them
 feel uncomfortable." -- Angela Johnson
_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: why is count of rows in LEFT JOIN more than the rows inthe left table

Puneet Kishor-2

On Feb 20, 2012, at 12:46 PM, Jay A. Kreibich wrote:

> On Mon, Feb 20, 2012 at 05:01:17PM +0000, Simon Slavin scratched on the wall:
>>
>> On 20 Feb 2012, at 4:11pm, Puneet Kishor <[hidden email]> wrote:
>>
>>> is there are way to force the query to return only the "first" match
>>> so the resulting query has only as many rows as the left table?
>>
>> You have two problems:
>>
>> A) You are JOINing two columns which contain some null values.  null is
>> interpreted as 'I don't know', so the joins tend to match many
>> different rows.
>
>  No, that's not how it works.  If you have a join condition like
>  (A.col == B.col), and either A.col or B.col (or both) is NULL, then
>  the result of the comparison itself is NULL ("Are these values
>  equal?"  "I don't know.").  JOIN operation will only match rows where
>  the join condition is TRUE.  NULL is not "good enough" to match rows.
>
>  Think about NATURAL JOINs where the two columns are reduced to only
>  one column.  This reduction can only take place if the values are
>  exactly the same.
>
>> B) You are using a LEFT JOIN which is explicitly defined to include
>> many matches, not just one match.  Instead you might try NATURAL JOIN
>> or CROSS JOIN to see what difference they make.
>
>  That's not the issue.  Or rather, it is, but that isn't going to show
>  us anything we don't already know.  NATURAL JOIN should return the
>  exact same results as example #5: 329743  (assuming "col" is the only
>  column name shared between tables), except it will merge the two
>  "col" columns into a single column.
>
>  A CROSS JOIN has no join condition and is going to return
>  2,066,471,848 rows.  That tells us nothing.
>
>> And you have a meta-problem too: the dataset you're experimenting with
>> is so big you can't see why it's doing what it's doing.
>
>  Databases are designed to work with sets.  If you want to debug the
>  issue, you need to think in sets as well.  Use the database to answer
>  your questions, rather than just scanning the results and doing it in
>  your head.
>
>  The statement...
>
>    SELECT COUNT(*) as ct, col FROM a GROUP BY 2 HAVING ct > 1
>
>  ...will tell you which rows have duplicate columns.  You can run this
>  on either table.
>
>  If you want to have unique values in the JOIN, the best approach is
>  to put a unique index across each "col" column, so that the columns
>  have unique values.
>
>  Failing that, replace the table reference in example #6 with
>  something like "(SELECT * FROM a GROUP BY col) as a".  That's going
>  to give you random rows (there is no concept of a "first" row), but
>  it will get you the row count you're looking for.


Thanks Jay... brilliant explanation. I ended up finding the dupes in table "b" and re-encoding the column so it had only unique values. You are right, "first" doesn't mean much in this case... I was referring to "the first matching row, and to hell with the subsequent matches." Fortunately, the problem was not too laborious to correct manually.


>
>  You might need to do it with both tables, depending on which table(s)
>  has the duplicate "col" values.

Yup, fortunately I know that; it was table "b". Table "a" has geographic features, and neither do I want to "create new" nor "lose existing" features. Table "b" has attributes to categorize features in table "a". Cleaning up table "b" cleared the problem.



--
Puneet Kishor

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: why is count of rows in LEFT JOIN more than the rowsinthe left table

Igor Tandetnik
In reply to this post by Puneet Kishor-2
Puneet Kishor <[hidden email]> wrote:
> is there are way to force the query to return only the "first" match so the resulting query has only as many rows as the left
> table?

SELECT COUNT(*) FROM (
    SELECT 1 FROM a LEFT JOIN b ON a.col = b.col GROUP BY a.rowid);

--
Igor Tandetnik

_______________________________________________
sqlite-users mailing list
[hidden email]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users