pyspark hive sql將數組（map（varchar，varchar））逐行轉換為字符串

Question

我想轉換一列

   array(map(varchar, varchar))

通過 pyspark hive sql 以編程方式從 jupyter notebook python3 將字符串作為 presto db 上的表的行。

例子

user_id     sport_ids
 'aca'       [ {'sport_id': '5818'}, {'sport_id': '6712'}, {'sport_id': '1065'} ]

預期成績

  user_id.    sport_ids
  'aca'.          '5815'
  'aca'.          '5712'
  'aca'.          '1065'

我努力了

     sql_q= """
            select distinct, user_id, transform(sport_ids, x -> element_at(x, 'sport_id')
            from tab """
            
     spark.sql(sql_q)

但出現錯誤：

   '->' cannot be resolved

我也試過

  sql_q= """
            select distinct, user_id, sport_ids
            from tab"""
            
     spark.sql(sql_q)

但出現錯誤：

    org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column request_features[0] is map<string,string>;;

我錯過了什么？

我已經嘗試過這個，但有用的hive convert array<map<string, string>> to string Extract map(varchar, array(varchar)) - Hive SQL

謝謝

Answer 1

讓我們嘗試使用高階函數來查找地圖值並分解為單獨的行

df.withColumn('sport_ids', explode(expr("transform(sport_ids, x->map_values(x)[0])"))).show()


+-------+---------+
|user_id|sport_ids|
+-------+---------+
|    aca|     5818|
|    aca|     6712|
|    aca|     1065|
+-------+---------+

Answer 2

您可以處理 json 數據（ json_parse ，轉換為 json 和json_extract_scalar數組 - 更多 json 函數 -請參見此處）並在 presto 端展平（ unnest ）：

-- sample data
WITH dataset(user_id, sport_ids) AS (
    VALUES 
        ('aca', '[ {"sport_id": "5818"}, {"sport_id": "6712"}, {"sport_id": "1065"} ]')
) 

-- query
select user_id,
    json_extract_scalar(record, '$.sport_id') sport_id
from dataset,
    unnest(cast(json_parse(sport_ids) as array(json))) as t(record)

輸出：

用戶身份	運動編號
阿卡	5818
阿卡	6712
阿卡	1065

pyspark hive sql將數組（map（varchar，varchar））逐行轉換為字符串

問題描述

2 個解決方案

解決方案1
0 2022-07-06 20:57:54

解決方案2
0 2022-07-07 15:32:14

pyspark hive sql將數組（map（varchar，varchar））逐行轉換為字符串

問題描述

2 個解決方案

解決方案1 0 2022-07-06 20:57:54

解決方案2 0 2022-07-07 15:32:14

解決方案1
0 2022-07-06 20:57:54

解決方案2
0 2022-07-07 15:32:14