简体   繁体   English

在 Spark 中将 String 转换为 Map

[英]Convert String to Map in Spark

Below data in csv file with delimeter |在带有分隔符的 csv 文件中的数据下方| , I want to convert string to Map for PersonalInfo column data so that I can extract required information. ,我想将字符串转换为 Map for PersonalInfo列数据,以便我可以提取所需的信息。

I try to convert below csv to parquet format with String to Map using Cast I got datatype mismatch error.我尝试使用 Cast 将下面的 csv 转换为 Parquet 格式,并使用String to Map我得到数据类型不匹配错误。

Below is data for your ref.以下是您的参考数据。 Your help is much appreciated.非常感谢您的帮助。

Empcode EmpName PersonalInfo
1       abc     """email"":""abc@gmail.com"",""Location"":""India"",""Gender"":""Male"""
2       xyz     """email"":""xyz@gmail.com"",""Location"":""US"""
3       pqr     """email"":""abc@gmail.com"",""Gender"":""Female"",""Location"":""Europe"",""Mobile"":""1234"""

Thanks谢谢

One simple way is to use str_to_map function after you get rid of the double quotes from PersonalInfo column:一种简单的方法是在去掉PersonalInfo列中的双引号后使用str_to_map函数:

val df1 = df.withColumn(
  "PersonalInfo",
  expr("str_to_map(regexp_replace(PersonalInfo, '\"', ''))")
)

df1.show(false)

//+-------+-------+------------------------------------------------------------------------------+
//|Empcode|EmpName|PersonalInfo                                                                  |
//+-------+-------+------------------------------------------------------------------------------+
//|1      |abc    |{email -> abc@gmail.com, Location -> India, Gender -> Male}                   |
//|2      |xyz    |{email -> xyz@gmail.com, Location -> US}                                      |
//|3      |pqr    |{email -> abc@gmail.com, Gender -> Female, Location -> Europe, Mobile -> 1234}|
//+-------+-------+------------------------------------------------------------------------------+

If you want to create a map from PersonalInfo column, from Spark 3.0 you can proceed as follows:如果要从PersonalInfo列创建映射,从 Spark 3.0 开始,您可以按以下步骤操作:

  • Split your string according to "","" using split function使用split函数根据"",""拆分字符串
  • For each element of your obtained string array, create sub-arrays according to "":"" using split function对于获得的字符串数组的每个元素,使用split函数根据"":""创建子数组
  • Remove all "" from elements of sub-arrays using regexp_replace function使用regexp_replace函数从子数组的元素中删除所有""
  • Build map entries using struct function使用struct函数构建地图条目
  • Use map_from_entries to build map from your array of entries使用map_from_entries从您的条目数组构建地图

Complete code is as follows:完整代码如下:

import org.apache.spark.sql.functions.{col, map_from_entries, regexp_replace, split, struct, transform}

val result = data.withColumn("PersonalInfo",
  map_from_entries(
    transform(
      split(col("PersonalInfo"), "\"\",\"\""),
      item => struct(
        regexp_replace(split(item, "\"\":\"\"")(0), "\"\"", ""),
        regexp_replace(split(item, "\"\":\"\"")(1), "\"\"", "")
      )
    )
  )
)

With the following input_dataframe :使用以下input_dataframe

+-------+-------+---------------------------------------------------------------------------------------------+
|Empcode|EmpName|PersonalInfo                                                                                 |
+-------+-------+---------------------------------------------------------------------------------------------+
|1      |abc    |""email"":""abc@gmail.com"",""Location"":""India"",""Gender"":""Male""                       |
|2      |xyz    |""email"":""xyz@gmail.com"",""Location"":""US""                                              |
|3      |pqr    |""email"":""abc@gmail.com"",""Gender"":""Female"",""Location"":""Europe"",""Mobile"":""1234""|
+-------+-------+---------------------------------------------------------------------------------------------+

You get the following result dataframe:您得到以下result数据框:

+-------+-------+------------------------------------------------------------------------------+
|Empcode|EmpName|PersonalInfo                                                                  |
+-------+-------+------------------------------------------------------------------------------+
|1      |abc    |{email -> abc@gmail.com, Location -> India, Gender -> Male}                   |
|2      |xyz    |{email -> xyz@gmail.com, Location -> US}                                      |
|3      |pqr    |{email -> abc@gmail.com, Gender -> Female, Location -> Europe, Mobile -> 1234}|
+-------+-------+------------------------------------------------------------------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM