简体   繁体   中英

Exclude a string starting with certain letter by a SQL query in Google BigQuery

I'm trying to exclude values from a larger public table in Googles BigQuery using the following SQL lines. The last line has the purpose to exclude entries, that start with a certain letter, eg 'C'. For some reason, when I add the line, the count increases. Logically the selected rows should decrease and I can't figure out why.

How can I make the exclusion work?

SELECT Count(*)
FROM `patents-public-data.patents.publications`,
unnest(description_localized) as description_unnest,
unnest(claims_localized) as claims_unnest,
unnest(cpc) as cpc_unnest
where description_unnest.language in ('en','EN')
and claims_unnest.language in ('en','EN')
and publication_date >19900101
and (SUBSTRING(cpc_unnest.code,1,1) <> 'C');

OK. I think I found the mistakes I made

  1. I compared the number of cases with and without this line #7. That was leading to the increased rows.

#7 unnest(cpc) as cpc_unnest

  1. THIS IS MOST IMPORTANT: I did not want to know the number of rows, but the number of unique entries. As the table is build up according to the publication numbers, I can use this number to search for unique entries. The SQL command Count(DISTINCT attribut) can be used:

SELECT Count(DISTINCT publication_number)

The whole solution is this

SELECT Count(DISTINCT publication_number)
FROM `patents-public-data.patents.publications`,
unnest(description_localized) as description_unnest,
unnest(claims_localized) as claims_unnest,
unnest(cpc) as cpc_unnest
where description_unnest.language in ('en','EN')
and claims_unnest.language in ('en','EN')
and publication_date >19900101
and (SUBSTRING(cpc_unnest.code,1,1) <> 'C')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM