[英]Returning multiple columns from a single pyspark dataframe
I am trying to parse a single single column of pyspark dataframe and get dataframe with multiple columns.My dataframe is as follows:我正在尝试解析单列 pyspark 数据框并获取具有多列的数据框。我的数据框如下:
a b dic
0 1 2 {'d': 1, 'e': 2}
1 3 4 {'d': 7, 'e': 0}
2 5 6 {'d': 5, 'e': 4}
I want to parse the dic column and get dataframe as follows.我想解析 dic 列并按如下方式获取数据框。 I am looking forward to use pandas UDF if possible.如果可能,我期待使用 Pandas UDF。 My intended output is as follows:我的预期输出如下:
a b c d
0 1 2 1 2
1 3 4 7 0
2 5 6 5 4
Here is my attempt to solution:这是我尝试解决的方法:
schema = StructType([
StructField("c", IntegerType()),
StructField("d", IntegerType())])
@pandas_udf(schema,PandasUDFType.GROUPED_MAP)
def do_someting(dic_col):
return (pd.DataFrame(dic_col))
df.apply(add_json).show(10)
But this gives error 'DataFrame' object has no attribute 'apply'但这给出了错误 'DataFrame' 对象没有属性 'apply'
You can transform first to JSON string by replacing simple quotes by double quotes, then use from_json
to convert it into a struct or map column.您可以先将简单引号替换为双引号将其转换为 JSON 字符串,然后使用from_json
将其转换为 struct 或 map 列。
If you know the schema of the dict you can do it like this:如果您知道 dict 的架构,您可以这样做:
data = [
(1, 2, "{'c': 1, 'd': 2}"),
(3, 4, "{'c': 7, 'd': 0}"),
(5, 6, "{'c': 5, 'd': 4}")
]
df = spark.createDataFrame(data, ["a", "b", "dic"])
schema = StructType([
StructField("c", StringType(), True),
StructField("d", StringType(), True)
])
df = df.withColumn("dic", from_json(regexp_replace(col("dic"), "'", "\""), schema))
df.select("a", "b", "dic.*").show(truncate=False)
#+---+---+---+---+
#|a |b |c |d |
#+---+---+---+---+
#|1 |2 |1 |2 |
#|3 |4 |7 |0 |
#|5 |6 |5 |4 |
#+---+---+---+---+
If you don't know the all the keys, you can convert it to a map instead of struct then explode it and pivot to get keys as columns:如果您不知道所有键,则可以将其转换为映射而不是结构,然后将其分解并旋转以获取键作为列:
df = df.withColumn("dic", from_json(regexp_replace(col("dic"), "'", "\""), MapType(StringType(), StringType())))\
.select("a", "b", explode("dic"))\
.groupBy("a", "b")\
.pivot("key")\
.agg(first("value"))
Try:尝试:
#to convert pyspark df into pandas:
df=df.toPandas()
df["d"]=df["dic"].str.get("d")
df["e"]=df["dic"].str.get("e")
df=df.drop(columns=["dic"])
Returns:返回:
a b d e
0 1 2 1 2
1 3 4 7 0
2 5 6 5 4
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.