[英]How to calculate median over multiple columns in Google BigQuery?
I'm creating a query to calculate median visits from two different websites by day.我正在创建一个查询来计算每天来自两个不同网站的访问中位数。
The output should look like the following:输出应如下所示:
+------------+---------+---------------+
| date | website | median_visits |
+------------+---------+---------------+
| 2019-04-01 | A | median_value |
| 2019-04-01 | B | median_value |
| 2019-04-02 | A | median_value |
| 2019-04-02 | B | median_value |
| 2019-04-03 | A | median_value |
| 2019-04-03 | B | median_value |
+------------+---------+---------------+
Here is what my table (there are 20,000 rows) looks like:这是我的表(有 20,000 行)的样子:
+------------+---------+--------+
| date | website | visits |
+------------+---------+--------+
| 2019-04-01 | A | 10.0 |
| 2019-04-01 | B | 14.0 |
| 2019-04-02 | A | 85.0 |
| 2019-04-03 | A | 75.0 |
| 2019-04-02 | B | 3.0 |
| 2019-04-02 | B | 45.0 |
| 2019-04-01 | A | 12.0 |
| 2019-04-03 | A | 44.0 |
| 2019-04-01 | A | 99.0 |
+------------+---------+--------+
What would be the most efficient way to query for the desired output?查询所需输出的最有效方法是什么? I am currently using:
我目前正在使用:
SELECT DISTINCT date, website, median_visits
FROM
(SELECT date, website, PERCENTILE_CONT(visits, 0.5)
OVER(PARTITION BY date, website) AS median_visits
FROM table)
Below is for BigQuery Standard SQL - I cannot claim it is the best.下面是 BigQuery 标准 SQL - 我不能说它是最好的。 I cannot even guarantee that it is better - but based on my testing I see better execution plan and slots usage.
我什至不能保证它更好 - 但根据我的测试,我看到更好的执行计划和插槽使用。 So, you can try and see with your data
因此,您可以尝试查看您的数据
#standardSQL
SELECT date, website,
(SELECT PERCENTILE_CONT(visit, 0.5) OVER()
FROM UNNEST(visits) visit LIMIT 1
) AS median_visits
FROM (
SELECT date, website, ARRAY_AGG(visits) visits
FROM `project.dataset.table`
GROUP BY date, website
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.