简体   繁体   中英

SQL WHERE NOT EXISTS query not returning results

I'm having trouble constrsucting a query and getting it to run on impala. I created the following working query to join two tables:

SELECT *
FROM illuminavariant as vcf, ensembl_genes as ens 
WHERE vcf.filter = "PASS" 
AND vcf.qual > 100 
AND vcf.chromosome = ens.chromosome
AND vcf.position BETWEEN ens.start AND ens.stop

Now I'm trying to write a query that finds all variants WHERE vcf.filter = "PASS" and vcf.qual > 100, but do not have a match by chromosome and position.

I've tried this:

SELECT * 
FROM p7dev.illumina_test, p7dev.ensembl_test
WHERE NOT EXISTS(
  SELECT * 
  FROM p7dev.illumina_test as vcf, p7dev.ensembl_test as ens
  WHERE vcf.chromosome = ens.chromosome  
  AND vcf.position BETWEEN ens.start AND ens.stop  
 )

But that's not returning any results. I think a WITH clause might do the trick, but I'd really appreciate if anybody could help me wrap my head around the logic of how that would work. Thanks so much!

Since you're looking for variants that are not correlated to any ensemble, it seems strange that you would form a cross join of variants and ensembles from which to filter out rows. If that's really what you want, though, then this ought to do it:

SELECT *
FROM illuminavariant as vcf, ensembl_genes as ens 
WHERE vcf.filter = "PASS" 
AND vcf.qual > 100 
AND (
    vcf.chromosome != ens.chromosome
    OR vcf.position < ens.start
    OR vcf.position > ens.stop
  )

That just negates the condition correlating variant rows with ensemble rows.

I suspect what you really want is more like this, though:

SELECT vcf.*
FROM
  illuminavariant as vcf
  LEFT JOIN ensembl_genes as ens 
    ON vcf.chromosome = ens.chromosome
    AND vcf.position BETWEEN ens.start AND ens.stop
WHERE
  vcf.filter = "PASS" 
  AND vcf.qual > 100
  AND ens.chromosome IS NULL

That performs the same join as your first query, but as a left join. The rows that actually represent matches are then filtered out by the ens.chromosome IS NULL condition. It returns only the columns of the variant table, as the whole point is to find variants that have no corresponding row in the ensemble table.

try this...

SELECT * 
FROM p7dev.illumina_test vcf
WHERE NOT EXISTS( SELECT 1
                  FROM p7dev.ensembl_test as ens
                  WHERE vcf.chromosome = ens.chromosome  
                  AND vcf.position BETWEEN ens.start AND ens.stop
                 )
AND vcf.filter = 'PASS' 
AND vcf.qual > 100

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM