[英]athena presto - multiple columns from long to wide
I am new to Athena and I am trying to understand how to turn multiple columns from long to wide format.我是 Athena 的新手,我正在尝试了解如何将多列从长格式转换为宽格式。 It seems like
presto
is what is needed, but I've only successfully been able to apply map_agg
to one variable.似乎需要的是
presto
,但我只能成功地将map_agg
应用于一个变量。 I think my below final outcome can be achieved with multimap_agg
but cannot quite get it to work.我认为我的以下最终结果可以通过
multimap_agg
实现,但无法完全实现。
Below I walk through my steps and data.下面我将介绍我的步骤和数据。 If you have some suggestions or questions, please let me know!
如果您有任何建议或问题,请告诉我!
First, the data starts like this:首先,数据是这样开始的:
id | letter | number | value
------------------------------------
123 | a | 1 | 62
123 | a | 2 | 38
123 | a | 3 | 44
123 | b | 1 | 74
123 | b | 2 | 91
123 | b | 3 | 97
123 | c | 1 | 38
123 | c | 2 | 98
123 | c | 3 | 22
456 | a | 1 | 99
456 | a | 2 | 33
456 | a | 3 | 81
456 | b | 1 | 34
456 | b | 2 | 79
456 | b | 3 | 43
456 | c | 1 | 86
456 | c | 2 | 60
456 | c | 3 | 59
Then I transform the data into the below using filtering with the where
clause and then joining
:然后我使用带有
where
子句的过滤然后joining
将数据转换为以下内容:
id | letter | 1 | 2 | 3
----------------------------
123 | a | 62 | 38 | 44
123 | b | 74 | 91 | 97
123 | c | 38 | 98 | 22
456 | a | 99 | 33 | 81
456 | b | 34 | 79 | 43
456 | c | 86 | 60 | 59
For the final outcome, I would like to transform it into the below:对于最终结果,我想将其转换为以下内容:
id | a_1 | a_2 | a_3 | b_1 | b_2 | b_3 | c_1 | c_2 | c_3
--------------------------------------------------------------------------
123 | 62 | 38 | 44 | 74 | 91 | 97 | 38 | 98 | 22
456 | 99 | 33 | 81 | 34 | 79 | 43 | 86 | 60 | 59
You can use window functions and conditional aggregation.您可以使用 window 函数和条件聚合。 This requires that you know in advance the possible letters, and the maximum rows per id/letter tuple:
这需要您事先知道可能的字母,以及每个 id/字母元组的最大行数:
select
id,
max(case when letter = 'a' and rn = 1 then value end) a_1,
max(case when letter = 'a' and rn = 2 then value end) a_2,
max(case when letter = 'a' and rn = 3 then value end) a_3,
max(case when letter = 'b' and rn = 1 then value end) b_1,
max(case when letter = 'b' and rn = 2 then value end) b_2,
max(case when letter = 'b' and rn = 3 then value end) b_3,
max(case when letter = 'c' and rn = 1 then value end) c_1,
max(case when letter = 'c' and rn = 2 then value end) c_2,
max(case when letter = 'c' and rn = 3 then value end) c_3
from (
select
t.*,
row_number() over(partition by id, letter order by number) rn
from mytable t
) t
group by id
Actually, if the number
s are always 1
, 2
, 3
, then you don't even need the window function:实际上,如果
number
总是1
、 2
、 3
,那么你甚至不需要 window function:
select
id,
max(case when letter = 'a' and number = 1 then value end) a_1,
max(case when letter = 'a' and number = 2 then value end) a_2,
max(case when letter = 'a' and number = 3 then value end) a_3,
max(case when letter = 'b' and number = 1 then value end) b_1,
max(case when letter = 'b' and number = 2 then value end) b_2,
max(case when letter = 'b' and number = 3 then value end) b_3,
max(case when letter = 'c' and number = 1 then value end) c_1,
max(case when letter = 'c' and number = 2 then value end) c_2,
max(case when letter = 'c' and number = 3 then value end) c_3
from mytable t
group by id
Athena needs the columns to be known at query time, but the next best thing is using a map, as you hint to in your question. Athena 需要在查询时知道这些列,但下一个最好的事情是使用 map,正如您在问题中暗示的那样。
One way to achieve the results you are after is this query ( the_table
refers to the first table in your questions, the one with id
, letter
, number
, and value
columns):实现您所追求的结果的一种方法是此查询(
the_table
指的是您问题中的第一个表,即具有id
、 letter
、 number
和value
列的表):
SELECT
id,
map_agg(letter || '_' || CAST(number AS varchar), value) AS letter_number_value
FROM the_table
GROUP BY id
Which gives this result:结果如下:
id | letter_number_value
----+-------------------------------------------------------------------------
123 | {a_1=62, a_2=38, a_3=44, b_1=74, b_2=91, b_3=97, c_1=38, c_2=98, c_3=22}
456 | {a_1=99, a_2=33, a_3=81, b_1=34, b_2=79, b_3=43, c_1=86, c_2=60, c_3=59}
I cheated slightly by manually sorting the map keys, if you run the query they will end up in arbitrary order, but I figured that this way it is easier to see that the result is the desired.我通过手动对 map 键进行排序来稍微作弊,如果您运行查询它们将以任意顺序结束,但我认为这样更容易看到结果是所需的。
Please note that this assumes there are no duplicate letter/number combinations, if there are I think it's undefined which value will end up in the result.请注意,这假设没有重复的字母/数字组合,如果有,我认为未定义哪个值将最终出现在结果中。
Also note that Athena's output format for maps is ambiguous and that there are situations where you can end up with unparseable results (for example when keys or values include equal signs or commas).另请注意,Athena 的 output 地图格式不明确,在某些情况下您可能会得到无法解析的结果(例如,当键或值包含等号或逗号时)。 Therefore I would recommend casting the map as JSON and using a JSON parser in your application code, eg
CAST(map_agg(…) AS JSON)
.因此,我建议将 map 转换为 JSON 并在您的应用程序代码中使用 JSON 解析器,例如
CAST(map_agg(…) AS JSON)
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.