[英]SQL Query validation failure on GCP BigQuery with github_repos dataset
I would like to get a list all unique repositories on GutHub by using the following command:我想使用以下命令获取 GutHub 上所有唯一存储库的列表:
SELECT DISTINCT repo_name FROM `bigquery-public-data.github_repos.commits`
However I get the following error:但是我收到以下错误:
Column repo_name of type ARRAY cannot be used in SELECT DISTINCT at [1:17]
ARRAY 类型的列 repo_name 不能在 [1:17] 的 SELECT DISTINCT 中使用
In the schema it says repo_name is of type STRING, what am I doing wrong?在它说 repo_name 是 STRING 类型的架构中,我做错了什么?
repo_name is defined as a "string" with mode "repeated" in the table schema which roughly means an ARRAY of STRING in BigQuery. repo_name 在表架构中被定义为模式为“重复”的“字符串”,这大致意味着 BigQuery 中的字符串数组。
https://cloud.google.com/bigquery/docs/nested-repeated https://cloud.google.com/bigquery/docs/nested-repeated
What does REPEATED field in Google Bigquery mean? Google Bigquery 中的 REPEATED 字段是什么意思?
You can use the below query您可以使用以下查询
SELECT
commit
, repo_name
FROM
`bigquery-public-data.github_repos.commits`,
UNNEST(repo_name) as repo_name
WHERE
commit = 'c87298e36356ac19519a93dee3dfac8ebffe45e8'
Which will give a result like below这将给出如下结果
Row | commit | repo_name
===================================================================
1 | c87298e36356ac19519a93dee3dfac8ebffe45e8 | noondaysun/sakai
2 | c87298e36356ac19519a93dee3dfac8ebffe45e8 | OpenCollabZA/sakai
As another user posted, in the schema of the bigquery-public-data.github_repos.commits
table you can see that the repo_name
field is defined as a STRING REPEATED which means that each entry of repo_name
is an array constituted by string-type elements.正如另一位用户发布的那样,在
bigquery-public-data.github_repos.commits
表的架构中,您可以看到repo_name
字段被定义为STRING REPEATED ,这意味着repo_name
的每个条目都是由字符串类型元素构成的数组。 You can see this with the following query:您可以通过以下查询看到这一点:
#standardSQL
SELECT repo_name
FROM `bigquery-public-data.github_repos.commits`
LIMIT 100;
In order to find the distinct repo names you can employ the UNNEST
operator to expand each one of the repo_name
elements.为了找到不同的 repo 名称,您可以使用
UNNEST
运算符来扩展每个repo_name
元素。 The following query performs a CROSS JOIN that adds a new field repo_name_single
to the table constituted by the individual repository names.以下查询执行 CROSS JOIN,将新字段
repo_name_single
添加到由各个存储库名称构成的表中。 This way, the DISTINCT
function can be employed.这样,可以使用
DISTINCT
function。
#standardSQL
SELECT DISTINCT(repo_name_unnest)
FROM `bigquery-public-data.github_repos.commits`
CROSS JOIN UNNEST(repo_name) AS repo_name_unnest;
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.