[英]Convert String to Map in Spark
Below data in csv file with delimeter |
在带有分隔符的 csv 文件中的数据下方
|
, I want to convert string to Map for PersonalInfo
column data so that I can extract required information. ,我想将字符串转换为 Map for
PersonalInfo
列数据,以便我可以提取所需的信息。
I try to convert below csv to parquet format with String
to Map
using Cast I got datatype mismatch error.我尝试使用 Cast 将下面的 csv 转换为 Parquet 格式,并使用
String
to Map
我得到数据类型不匹配错误。
Below is data for your ref.以下是您的参考数据。 Your help is much appreciated.
非常感谢您的帮助。
Empcode EmpName PersonalInfo
1 abc """email"":""abc@gmail.com"",""Location"":""India"",""Gender"":""Male"""
2 xyz """email"":""xyz@gmail.com"",""Location"":""US"""
3 pqr """email"":""abc@gmail.com"",""Gender"":""Female"",""Location"":""Europe"",""Mobile"":""1234"""
Thanks谢谢
One simple way is to use str_to_map
function after you get rid of the double quotes from PersonalInfo
column:一种简单的方法是在去掉
PersonalInfo
列中的双引号后使用str_to_map
函数:
val df1 = df.withColumn(
"PersonalInfo",
expr("str_to_map(regexp_replace(PersonalInfo, '\"', ''))")
)
df1.show(false)
//+-------+-------+------------------------------------------------------------------------------+
//|Empcode|EmpName|PersonalInfo |
//+-------+-------+------------------------------------------------------------------------------+
//|1 |abc |{email -> abc@gmail.com, Location -> India, Gender -> Male} |
//|2 |xyz |{email -> xyz@gmail.com, Location -> US} |
//|3 |pqr |{email -> abc@gmail.com, Gender -> Female, Location -> Europe, Mobile -> 1234}|
//+-------+-------+------------------------------------------------------------------------------+
If you want to create a map from PersonalInfo
column, from Spark 3.0 you can proceed as follows:如果要从
PersonalInfo
列创建映射,从 Spark 3.0 开始,您可以按以下步骤操作:
"",""
using split
functionsplit
函数根据"",""
拆分字符串"":""
using split
functionsplit
函数根据"":""
创建子数组""
from elements of sub-arrays using regexp_replace
functionregexp_replace
函数从子数组的元素中删除所有""
struct
functionstruct
函数构建地图条目map_from_entries
to build map from your array of entriesmap_from_entries
从您的条目数组构建地图Complete code is as follows:完整代码如下:
import org.apache.spark.sql.functions.{col, map_from_entries, regexp_replace, split, struct, transform}
val result = data.withColumn("PersonalInfo",
map_from_entries(
transform(
split(col("PersonalInfo"), "\"\",\"\""),
item => struct(
regexp_replace(split(item, "\"\":\"\"")(0), "\"\"", ""),
regexp_replace(split(item, "\"\":\"\"")(1), "\"\"", "")
)
)
)
)
With the following input_dataframe
:使用以下
input_dataframe
:
+-------+-------+---------------------------------------------------------------------------------------------+
|Empcode|EmpName|PersonalInfo |
+-------+-------+---------------------------------------------------------------------------------------------+
|1 |abc |""email"":""abc@gmail.com"",""Location"":""India"",""Gender"":""Male"" |
|2 |xyz |""email"":""xyz@gmail.com"",""Location"":""US"" |
|3 |pqr |""email"":""abc@gmail.com"",""Gender"":""Female"",""Location"":""Europe"",""Mobile"":""1234""|
+-------+-------+---------------------------------------------------------------------------------------------+
You get the following result
dataframe:您得到以下
result
数据框:
+-------+-------+------------------------------------------------------------------------------+
|Empcode|EmpName|PersonalInfo |
+-------+-------+------------------------------------------------------------------------------+
|1 |abc |{email -> abc@gmail.com, Location -> India, Gender -> Male} |
|2 |xyz |{email -> xyz@gmail.com, Location -> US} |
|3 |pqr |{email -> abc@gmail.com, Gender -> Female, Location -> Europe, Mobile -> 1234}|
+-------+-------+------------------------------------------------------------------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.