简体   繁体   English

SQL 使用 github_repos 数据集在 GCP BigQuery 上查询验证失败

[英]SQL Query validation failure on GCP BigQuery with github_repos dataset

I would like to get a list all unique repositories on GutHub by using the following command:我想使用以下命令获取 GutHub 上所有唯一存储库的列表:

SELECT DISTINCT repo_name FROM `bigquery-public-data.github_repos.commits`

However I get the following error:但是我收到以下错误:

Column repo_name of type ARRAY cannot be used in SELECT DISTINCT at [1:17] ARRAY 类型的列 repo_name 不能在 [1:17] 的 SELECT DISTINCT 中使用

In the schema it says repo_name is of type STRING, what am I doing wrong?在它说 repo_name 是 STRING 类型的架构中,我做错了什么?

repo_name is defined as a "string" with mode "repeated" in the table schema which roughly means an ARRAY of STRING in BigQuery. repo_name 在表架构中被定义为模式为“重复”的“字符串”,这大致意味着 BigQuery 中的字符串数组。

https://cloud.google.com/bigquery/docs/nested-repeated https://cloud.google.com/bigquery/docs/nested-repeated

What does REPEATED field in Google Bigquery mean? Google Bigquery 中的 REPEATED 字段是什么意思?

You can use the below query您可以使用以下查询

SELECT 
    commit
   , repo_name 
FROM 
   `bigquery-public-data.github_repos.commits`, 
    UNNEST(repo_name) as repo_name 
WHERE 
    commit = 'c87298e36356ac19519a93dee3dfac8ebffe45e8' 

Which will give a result like below这将给出如下结果

Row |  commit                                  | repo_name
===================================================================
1   | c87298e36356ac19519a93dee3dfac8ebffe45e8 | noondaysun/sakai
2   | c87298e36356ac19519a93dee3dfac8ebffe45e8 | OpenCollabZA/sakai

As another user posted, in the schema of the bigquery-public-data.github_repos.commits table you can see that the repo_name field is defined as a STRING REPEATED which means that each entry of repo_name is an array constituted by string-type elements.正如另一位用户发布的那样,在bigquery-public-data.github_repos.commits表的架构中,您可以看到repo_name字段被定义为STRING REPEATED ,这意味着repo_name的每个条目都是由字符串类型元素构成的数组。 You can see this with the following query:您可以通过以下查询看到这一点:

#standardSQL
SELECT repo_name 
FROM `bigquery-public-data.github_repos.commits` 
LIMIT 100;

In order to find the distinct repo names you can employ the UNNEST operator to expand each one of the repo_name elements.为了找到不同的 repo 名称,您可以使用UNNEST运算符来扩展每个repo_name元素。 The following query performs a CROSS JOIN that adds a new field repo_name_single to the table constituted by the individual repository names.以下查询执行 CROSS JOIN,将新字段repo_name_single添加到由各个存储库名称构成的表中。 This way, the DISTINCT function can be employed.这样,可以使用DISTINCT function。

#standardSQL
SELECT DISTINCT(repo_name_unnest) 
FROM `bigquery-public-data.github_repos.commits` 
CROSS JOIN UNNEST(repo_name) AS repo_name_unnest;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM