简体   繁体   中英

How to select rows with exactly 2 values in a column fast within a table that has 10 million records?

I have a table (TestFI) with the following data for instance

FIID   Email
---------
null a@a.com
1    a@a.com   
null b@b.com    
2    b@b.com    
3    c@c.com    
4    c@c.com    
5    c@c.com    
null d@d.com    
null d@d.com

and I need records that appear exactly twice AND have 1 row with FIID is null and one is not. Such for the data above, only "a@a.com and b@b.com" fit the bill.

I was able to construct a multilevel query like so

    Select
FIID,
Email
from
TestFI
where
Email in
(
    Select
        Email
    from
    (
        Select
                Email
            from
                TestFI
            where
                Email in 
                (
                select
                    Email
                from
                    TestFI
                where
                    FIID is null or FIID is not null
                group by Email
                having 
                    count(Email) = 2
                )
                and
                FIID is null
    )as Temp1
    group by Email
    having count(Email) = 1
)

However, it took nearly 10 minutes to go through 10 million records. Is there a better way to do this? I know I must be doing some dumb things here.

Thanks

I would try this query:

SELECT   EMail, MAX(FFID)
FROM     TestFI
GROUP BY EMail
HAVING   COUNT(*)=2 AND COUNT(FIID)=1

It will return the EMail column, and the non-null value of FFID. The other value of FFID is null.

With an index on (email, fid) , I would be tempted to try:

select  tnull.*, tnotnull.*
from testfi tnull join
     testfi tnotnull
     on tnull.email = tnotnull.email left outer join
     testfi tnothing
     on tnull.email = tnothing.email
where tnothing.email is null and
      tnull.fid is null and
      tnotnull.fid is not null;

Performance definitely depends on the database. This will keep all the accesses within the index. In some databases, an aggregation might be faster. Performance also depends on the selectivity of the queries. For instance, if there is one NULL record and you have the index (fid, email) , this should be much faster than an aggregation.

Maybe something like ...

select
  a.FIID,
  a.Email

from
  TestFI a
  inner join TestFI b on (a.Email=b.Email)

where
  a.FIID is not null
  and b.FIID is null
;

And make sure Email and FIID are indexed.

I need records that appear exactly twice AND have 1 row with FIID is null and one is not

1

On the innermost select, group by email having count = 2:

        select email, coalesce(fiid,-1) as AdjusteFIID from T
        group by email having count(email) =2

2

        select email, AdjustedFIID
        from
        (
          select email, coalesce(fiid,-1) as AdjusteFIID from T
        group by email having count(email) =2
        )  as X
        group by email
        having min(adjustedFIID) = -1 and max(adjustedFIID) > -1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM