简体   繁体   中英

How to get counts for all 3 components of two tables using SQL?

Suppose I want to find, for two tables A and B, the counts of all records that are in A but not B, all records that are in A and B, and all records that are in B but not in A. I don't want the actual records, just the counts of all 3 components (think of a Venn diagram).

When I say, for example, records in A and B, I mean a count of all records that have identical values for, say, four variables (like ID, Year, Month, Day).

Is there a snazzy query that will return these counts efficiently?

For all that are in A and B, it's a simple JOIN:

SELECT COUNT(*)
FROM A
JOIN B ON A.ID = B.ID AND A.Year = B.Year AND A.Month = B.Month AND A.Day = B.Day

Note that this assumes that the combination (ID, Year, Month, Day) is unique in each table; if there are duplicates, it will count all the cross-products between the equivalent one. If ID is a unique key in the tables, that shouldn't be a problem.

For all A's not in B, use a LEFT JOIN:

SELECT COUNT(*)
FROM A
LEFT JOIN B ON A.ID = B.ID AND A.Year = B.Year AND A.Month = B.Month AND A.Day = B.Day
WHERE B.ID IS NULL

For all B's not in A, do the same thing but reverse the roles of A and B:

SELECT COUNT(*)
FROM B
LEFT JOIN A ON A.ID = B.ID AND A.Year = B.Year AND A.Month = B.Month AND A.Day = B.Day
WHERE A.ID IS NULL

You could combine the first two into a single query:

SELECT SUM(B.ID IS NOT NULL) AS A_and_B_count, SUM(B.ID IS NULL) AS A_not_B_count
FROM A
LEFT JOIN B ON A.ID = B.ID AND A.Year = B.Year AND A.Month = B.Month AND A.Day = B.Day

But I don't think it's possible to include the third query in this. That would require a FULL OUTER JOIN , which MySQL doesn't have.

For all these queries, make sure that at least one of the columns you're comparing has an index, otherwise this will be very slow; the more the better. Although if any of them is unique (eg the ID) field, that should be sufficient.

The queries to obtain those counts will be most efficient if there are suitable indexes available (preferably) on both of the tables, containing the columns you want to compare as the leading columns, eg

ON `table_A` (`id`, `year`, `month`, `day`)
ON `table_B` (`id`, `year`, `month`, `day`)

With these indexes available, MySQL can satisfy some queries entirely from the indexes (EXPLAIN output will show "Using index".)

Assuming that the combination of these columns is UNIQUE in each table...

To get the count of rows in a that don't have a matching row in b , we can use an anti-join pattern: return all rows from a, along with any matching rows from b , and then exclude any rows that found matches, so we are left with rows from a that didn't have a match. Note that this is an "outer" join, with a predicate in the WHERE clause that tests for a NULL value

SELECT COUNT(1)  AS cnt
  FROM Table_A a
  LEFT
  JOIN Table_B b
    ON b.id    = a.id
   AND b.year  = a.year
   AND b.month = a.month
   AND b.day   = a.day 
 WHERE b.id IS NULL

To get the count of rows in b that don't have a matching row in a , it's the same query, but reversed.

SELECT COUNT(1) AS cnt
  FROM Table_B b
  LEFT
  JOIN Table_A a
    ON a.id    = b.id
   AND a.year  = b.year
   AND a.month = b.month
   AND a.day   = b.day 
 WHERE a.id IS NULL

To get a count of rows that are in both a and b , we could use an inner join

SELECT COUNT(1)  AS cnt
  FROM Table_A a
 INNER
  JOIN Table_B b
    ON b.id    = a.id
   AND b.year  = a.year
   AND b.month = a.month
   AND b.day   = a.day 

These queries can be combined into a single query using a UNION ALL set operators; we'd want to include a discriminator column in each query that would let us know which query returned which row.

Or, they could be run as subqueries in the SELECT list, or as inline views.


For improved performance, we could get two of the queries combined, the "in a and b" count and the "in a not b" in a single query.

I'd probably combine those to get all three counts in a single query, I'd use two inline views, something like this:

SELECT c.in_a_and_b
     , c.in_a_not_b
     , d.in_b_not_a
  FROM ( SELECT IFNULL(SUM(b.id IS NOT NULL),0) AS `in_a_and_b`
              , IFNULL(SUM(b.id IS NULL),0)     AS `in_a_not_b`
           FROM Table_A a
           LEFT
           JOIN Table_B b
             ON b.id    = a.id
            AND b.year  = a.year
            AND b.month = a.month
            AND b.day   = a.day 
       ) c
 CROSS
  JOIN ( SELECT COUNT(1) AS `in_b_not_a`
           FROM Table_B b
           LEFT
           JOIN Table_A a
             ON a.id    = b.id
            AND a.year  = b.year
            AND a.month = b.month
            AND a.day   = b.day 
          WHERE a.id IS NULL
       ) d

You can use union (which automatically removes duplicates) to get a master table of all unique rows and left join that table to tables a and b to get your counts.

This assumes that tables a and b don't contain duplicates within the table (otherwise the left joins will produce inflated counts).

select 
    count(all_rows.id) total_unique_count,
    sum(a.id is not null and b.id is not null) in_both_count,
    sum(a.id is not null and b.id is null) only_in_a_count,
    sum(a.id is null and b.id not null) only_in_b_count
from (
    select id, year, month, day from tablea
    union
    select id, year, month, day from tableb
) all_rows
left join tablea a 
    on a.id = all_rows.id
    and a.year = all_rows.year
    and a.month = all_rows.month
    and a.day = all_rows.day
left join tableb b
    on b.id = all_rows.id
    and b.year = all_rows.year
    and b.month = all_rows.month
    and b.day = all_rows.day

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM