简体   繁体   English

athena presto - 从长到宽的多列

[英]athena presto - multiple columns from long to wide

I am new to Athena and I am trying to understand how to turn multiple columns from long to wide format.我是 Athena 的新手,我正在尝试了解如何将多列从长格式转换为宽格式。 It seems like presto is what is needed, but I've only successfully been able to apply map_agg to one variable.似乎需要的是presto ,但我只能成功地将map_agg应用于一个变量。 I think my below final outcome can be achieved with multimap_agg but cannot quite get it to work.我认为我的以下最终结果可以通过multimap_agg实现,但无法完全实现。

Below I walk through my steps and data.下面我将介绍我的步骤和数据。 If you have some suggestions or questions, please let me know!如果您有任何建议或问题,请告诉我!

First, the data starts like this:首先,数据是这样开始的:

id  | letter    | number   | value
------------------------------------
123 | a         | 1        | 62
123 | a         | 2        | 38
123 | a         | 3        | 44
123 | b         | 1        | 74
123 | b         | 2        | 91
123 | b         | 3        | 97
123 | c         | 1        | 38
123 | c         | 2        | 98
123 | c         | 3        | 22
456 | a         | 1        | 99
456 | a         | 2        | 33
456 | a         | 3        | 81
456 | b         | 1        | 34
456 | b         | 2        | 79
456 | b         | 3        | 43
456 | c         | 1        | 86
456 | c         | 2        | 60
456 | c         | 3        | 59

Then I transform the data into the below using filtering with the where clause and then joining :然后我使用带有where子句的过滤然后joining将数据转换为以下内容:

id  | letter  | 1  | 2  | 3
----------------------------
123 | a       | 62 | 38 | 44
123 | b       | 74 | 91 | 97
123 | c       | 38 | 98 | 22
456 | a       | 99 | 33 | 81
456 | b       | 34 | 79 | 43
456 | c       | 86 | 60 | 59

For the final outcome, I would like to transform it into the below:对于最终结果,我想将其转换为以下内容:

id  | a_1   | a_2   | a_3   | b_1   | b_2   | b_3   | c_1   | c_2   | c_3
--------------------------------------------------------------------------
123 | 62    | 38    | 44    | 74    | 91    | 97    | 38    | 98    | 22
456 | 99    | 33    | 81    | 34    | 79    | 43    | 86    | 60    | 59

You can use window functions and conditional aggregation.您可以使用 window 函数和条件聚合。 This requires that you know in advance the possible letters, and the maximum rows per id/letter tuple:这需要您事先知道可能的字母,以及每个 id/字母元组的最大行数:

select
    id,
    max(case when letter = 'a' and rn = 1 then value end) a_1,
    max(case when letter = 'a' and rn = 2 then value end) a_2,
    max(case when letter = 'a' and rn = 3 then value end) a_3,
    max(case when letter = 'b' and rn = 1 then value end) b_1,
    max(case when letter = 'b' and rn = 2 then value end) b_2,
    max(case when letter = 'b' and rn = 3 then value end) b_3,
    max(case when letter = 'c' and rn = 1 then value end) c_1,
    max(case when letter = 'c' and rn = 2 then value end) c_2,
    max(case when letter = 'c' and rn = 3 then value end) c_3
from (
    select 
        t.*, 
        row_number() over(partition by id, letter order by number) rn
    from mytable t
) t
group by id

Actually, if the number s are always 1 , 2 , 3 , then you don't even need the window function:实际上,如果number总是123 ,那么你甚至不需要 window function:

select
    id,
    max(case when letter = 'a' and number = 1 then value end) a_1,
    max(case when letter = 'a' and number = 2 then value end) a_2,
    max(case when letter = 'a' and number = 3 then value end) a_3,
    max(case when letter = 'b' and number = 1 then value end) b_1,
    max(case when letter = 'b' and number = 2 then value end) b_2,
    max(case when letter = 'b' and number = 3 then value end) b_3,
    max(case when letter = 'c' and number = 1 then value end) c_1,
    max(case when letter = 'c' and number = 2 then value end) c_2,
    max(case when letter = 'c' and number = 3 then value end) c_3
from mytable t
group by id

Athena needs the columns to be known at query time, but the next best thing is using a map, as you hint to in your question. Athena 需要在查询时知道这些列,但下一个最好的事情是使用 map,正如您在问题中暗示的那样。

One way to achieve the results you are after is this query ( the_table refers to the first table in your questions, the one with id , letter , number , and value columns):实现您所追求的结果的一种方法是此查询( the_table指的是您问题中的第一个表,即具有idletternumbervalue列的表):

SELECT
  id,
  map_agg(letter || '_' || CAST(number AS varchar), value) AS letter_number_value
FROM the_table
GROUP BY id

Which gives this result:结果如下:

id  | letter_number_value
----+-------------------------------------------------------------------------
123 | {a_1=62, a_2=38, a_3=44, b_1=74, b_2=91, b_3=97, c_1=38, c_2=98, c_3=22}
456 | {a_1=99, a_2=33, a_3=81, b_1=34, b_2=79, b_3=43, c_1=86, c_2=60, c_3=59}

I cheated slightly by manually sorting the map keys, if you run the query they will end up in arbitrary order, but I figured that this way it is easier to see that the result is the desired.我通过手动对 map 键进行排序来稍微作弊,如果您运行查询它们将以任意顺序结束,但我认为这样更容易看到结果是所需的。

Please note that this assumes there are no duplicate letter/number combinations, if there are I think it's undefined which value will end up in the result.请注意,这假设没有重复的字母/数字组合,如果有,我认为未定义哪个值将最终出现在结果中。

Also note that Athena's output format for maps is ambiguous and that there are situations where you can end up with unparseable results (for example when keys or values include equal signs or commas).另请注意,Athena 的 output 地图格式不明确,在某些情况下您可能会得到无法解析的结果(例如,当键或值包含等号或逗号时)。 Therefore I would recommend casting the map as JSON and using a JSON parser in your application code, eg CAST(map_agg(…) AS JSON) .因此,我建议将 map 转换为 JSON 并在您的应用程序代码中使用 JSON 解析器,例如CAST(map_agg(…) AS JSON)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM