SQL 使用 github_repos 数据集在 GCP BigQuery 上查询验证失败

Question

I would like to get a list all unique repositories on GutHub by using the following command:我想使用以下命令获取 GutHub 上所有唯一存储库的列表：

SELECT DISTINCT repo_name FROM `bigquery-public-data.github_repos.commits`

However I get the following error:但是我收到以下错误：

Column repo_name of type ARRAY cannot be used in SELECT DISTINCT at [1:17] ARRAY 类型的列 repo_name 不能在 [1:17] 的 SELECT DISTINCT 中使用

In the schema it says repo_name is of type STRING, what am I doing wrong?在它说 repo_name 是 STRING 类型的架构中，我做错了什么？

Answer 1

repo_name is defined as a "string" with mode "repeated" in the table schema which roughly means an ARRAY of STRING in BigQuery. repo_name 在表架构中被定义为模式为“重复”的“字符串”，这大致意味着 BigQuery 中的字符串数组。

https://cloud.google.com/bigquery/docs/nested-repeated https://cloud.google.com/bigquery/docs/nested-repeated

What does REPEATED field in Google Bigquery mean? Google Bigquery 中的 REPEATED 字段是什么意思？

Answer 2

You can use the below query您可以使用以下查询

SELECT 
    commit
   , repo_name 
FROM 
   `bigquery-public-data.github_repos.commits`, 
    UNNEST(repo_name) as repo_name 
WHERE 
    commit = 'c87298e36356ac19519a93dee3dfac8ebffe45e8'

Which will give a result like below这将给出如下结果

Row |  commit                                  | repo_name
===================================================================
1   | c87298e36356ac19519a93dee3dfac8ebffe45e8 | noondaysun/sakai
2   | c87298e36356ac19519a93dee3dfac8ebffe45e8 | OpenCollabZA/sakai

Answer 3

As another user posted, in the schema of the bigquery-public-data.github_repos.commits table you can see that the repo_name field is defined as a STRING REPEATED which means that each entry of repo_name is an array constituted by string-type elements.正如另一位用户发布的那样，在bigquery-public-data.github_repos.commits表的架构中，您可以看到repo_name字段被定义为STRING REPEATED ，这意味着repo_name的每个条目都是由字符串类型元素构成的数组。 You can see this with the following query:您可以通过以下查询看到这一点：

#standardSQL
SELECT repo_name 
FROM `bigquery-public-data.github_repos.commits` 
LIMIT 100;

In order to find the distinct repo names you can employ the UNNEST operator to expand each one of the repo_name elements.为了找到不同的 repo 名称，您可以使用UNNEST运算符来扩展每个repo_name元素。 The following query performs a CROSS JOIN that adds a new field repo_name_single to the table constituted by the individual repository names.以下查询执行 CROSS JOIN，将新字段repo_name_single添加到由各个存储库名称构成的表中。 This way, the DISTINCT function can be employed.这样，可以使用DISTINCT function。

#standardSQL
SELECT DISTINCT(repo_name_unnest) 
FROM `bigquery-public-data.github_repos.commits` 
CROSS JOIN UNNEST(repo_name) AS repo_name_unnest;

SQL 使用 github_repos 数据集在 GCP BigQuery 上查询验证失败

问题描述

3 个解决方案

解决方案1
2 2019-11-15 00:19:34

解决方案2
1 2019-11-15 06:16:06

解决方案3
1 已采纳 2019-11-15 15:44:21

SQL 使用 github_repos 数据集在 GCP BigQuery 上查询验证失败

问题描述

3 个解决方案

解决方案1 2 2019-11-15 00:19:34

解决方案2 1 2019-11-15 06:16:06

解决方案3 1 已采纳 2019-11-15 15:44:21

解决方案1
2 2019-11-15 00:19:34

解决方案2
1 2019-11-15 06:16:06

解决方案3
1 已采纳 2019-11-15 15:44:21