从单个 pyspark 数据帧返回多列

Question

I am trying to parse a single single column of pyspark dataframe and get dataframe with multiple columns.My dataframe is as follows:我正在尝试解析单列 pyspark 数据框并获取具有多列的数据框。我的数据框如下：

   a  b               dic
0  1  2  {'d': 1, 'e': 2}
1  3  4  {'d': 7, 'e': 0}
2  5  6  {'d': 5, 'e': 4}

I want to parse the dic column and get dataframe as follows.我想解析 dic 列并按如下方式获取数据框。 I am looking forward to use pandas UDF if possible.如果可能，我期待使用 Pandas UDF。 My intended output is as follows:我的预期输出如下：

   a  b  c  d
0  1  2  1  2
1  3  4  7  0
2  5  6  5  4

Here is my attempt to solution:这是我尝试解决的方法：

schema = StructType([
    StructField("c", IntegerType()),
    StructField("d", IntegerType())])

@pandas_udf(schema,PandasUDFType.GROUPED_MAP)
def do_someting(dic_col):
    return (pd.DataFrame(dic_col))

df.apply(add_json).show(10)

But this gives error 'DataFrame' object has no attribute 'apply'但这给出了错误 'DataFrame' 对象没有属性 'apply'

Answer 1

You can transform first to JSON string by replacing simple quotes by double quotes, then use from_json to convert it into a struct or map column.您可以先将简单引号替换为双引号将其转换为 JSON 字符串，然后使用from_json将其转换为 struct 或 map 列。

If you know the schema of the dict you can do it like this:如果您知道 dict 的架构，您可以这样做：

data = [
    (1,   2,  "{'c': 1, 'd': 2}"),
    (3,   4,  "{'c': 7, 'd': 0}"),
    (5,   6,  "{'c': 5, 'd': 4}")
]

df = spark.createDataFrame(data, ["a", "b", "dic"])

schema = StructType([
    StructField("c", StringType(), True),
    StructField("d", StringType(), True)
])

df = df.withColumn("dic", from_json(regexp_replace(col("dic"), "'", "\""), schema))

df.select("a", "b", "dic.*").show(truncate=False)

#+---+---+---+---+
#|a  |b  |c  |d  |
#+---+---+---+---+
#|1  |2  |1  |2  |
#|3  |4  |7  |0  |
#|5  |6  |5  |4  |
#+---+---+---+---+

If you don't know the all the keys, you can convert it to a map instead of struct then explode it and pivot to get keys as columns:如果您不知道所有键，则可以将其转换为映射而不是结构，然后将其分解并旋转以获取键作为列：

df = df.withColumn("dic", from_json(regexp_replace(col("dic"), "'", "\""), MapType(StringType(), StringType())))\
       .select("a", "b", explode("dic"))\
       .groupBy("a", "b")\
       .pivot("key")\
       .agg(first("value"))

Answer 2

Try:尝试：

#to convert pyspark df into pandas:
df=df.toPandas()

df["d"]=df["dic"].str.get("d")
df["e"]=df["dic"].str.get("e")
df=df.drop(columns=["dic"])

Returns:返回：

   a  b  d  e
0  1  2  1  2
1  3  4  7  0
2  5  6  5  4

从单个 pyspark 数据帧返回多列

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-02-29 20:45:29

解决方案2
0 2020-02-29 19:46:02

从单个 pyspark 数据帧返回多列

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-02-29 20:45:29

解决方案2 0 2020-02-29 19:46:02

解决方案1
2 已采纳 2020-02-29 20:45:29

解决方案2
0 2020-02-29 19:46:02