[英]Google BigQuery SQL Statement
我正在尝试使用Google Big Query从GitHub存档中获取一些数据。 我正在请求的当前数据量太大,BigQuery无法处理(至少在免费层中),因此我试图限制请求的范围。
我想限制数据,以便只有当前拥有超过1000个星的存储库才能返回历史数据。 这比仅说“ repository_watchers> 1000”更为复杂,因为它将排除存储库获得的前1000个星星的历史数据。
SELECT repository_name, repository_owner, created_at, type, repository_url, repository_watchers
FROM [githubarchive:github.timeline]
WHERE type="WatchEvent"
ORDER BY created_at DESC
编辑:我使用的解决方案(基于@Brian的答案)
select y.repository_name, y.repository_owner, y.created_at, y.type, y.repository_url, y.repository_watchers
from [githubarchive:github.timeline] y
join (select repository_url, max(repository_watchers)
from [githubarchive:github.timeline] x
where x.type = 'WatchEvent'
group by repository_url
having max(repository_watchers) > 1000) x
on y.repository_url = x.repository_url
where y.type = 'WatchEvent'
order by y.repository_name, y.repository_owner, y.created_at desc
尝试:
select y.*
from [githubarchive :github.timeline] y
join (select repository_name, max(repository_watchers)
from [githubarchive :github.timeline]
where x.type = 'WatchEvent'
group by repository_name
having max(repository_watchers) > 1000) x
on y.repository_name = x.repository_name
order by y.created_at desc
如果不支持该语法,则可以使用以下三步解决方案:
步骤1:找出哪些REPOSITORY_NAME值至少有一条记录,且REPOSITORY_WATCHERS的数量> 1000
select repository_name, max(repository_watchers) as curr_watchers
from [githubarchive :github.timeline]
where type = 'WatchEvent'
group by repository_name
having max(repository_watchers) > 1000
步骤2:将结果存储为表格,将其命名为SUB
步骤3:对SUB(和您的原始表)运行以下命令
select y.*
from [githubarchive :github.timeline] y
join sub x
on y.repository_name = x.repository_name
order by y.created_at desc
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.