简体   繁体   English

将字典列表转换为 pyspark dataframe

[英]Convert a list of dictionaries into pyspark dataframe

I have a list of dictionaries looks like the following.我有一个字典列表,如下所示。 Every dictionary is a list item.每个字典都是一个列表项。

my_list= [{"_id":1,"name":"xxx"},
    {"_id":2,"name":"yyy"},
    {"_id":3,"_name":"zzz"}]

I am trying to convert the list into a pyspark dataframe, with every dictionary being a row.我正在尝试将列表转换为 pyspark dataframe,每个字典都是一行。

from pyspark.sql.types import StringType

df = spark.createDataFrame(my_list, StringType())

df.show()

My ideal result is the following:我的理想结果如下:

+-----------------------------------------+
|                                    dic|
+-----------------------------------------+
|{"_id":1,"name":"xxx"}|
|{"_id":2,"name":"yyy"}|
|{"_id":3,"_name":"zzz"}|
+-----------------------------------------+

But I got the error:但我得到了错误:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 25.0 failed 4 times, most recent failure: Lost task 0.3 in stage 25.0 (TID 95, 10.0.16.11, executor 0): org.apache.spark.api.python.PythonException: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last): org.apache.spark.SparkException:作业因阶段故障而中止:阶段 25.0 中的任务 0 失败 4 次,最近失败:阶段 25.0 中的任务 0.3 丢失(TID 95、10.0.16.11、执行程序 0):org.ZB06EFD606D1418D spark.api.python.PythonException: 'pyspark.serializers.SerializationError: 由 Traceback 引起(最近一次调用最后一次):

What's wrong with my code?我的代码有什么问题?

Spark might have difficulty in casting the Python dictionaries to strings. Spark 可能难以将 Python 字典转换为字符串。 You can convert the dictionaries to strings before creating a dataframe:您可以在创建 dataframe 之前将字典转换为字符串:

df = spark.createDataFrame([str(i) for i in my_list], StringType())

You need to convert the dicts into strings before creating dataframe.在创建 dataframe 之前,您需要将字典转换为字符串。 However, I'd suggest you not to store values as stringfied dicts.但是,我建议您不要将值存储为字符串化的字典。 It wouldn't be easy to parse them for further transformations later.稍后解析它们以进行进一步的转换并不容易。 Use JSON strings instead:请改用 JSON 字符串:

import json

df = spark.createDataFrame([[json.dumps(d)] for d in my_list], ["dict"])

df.show(truncate=False)

#+--------------------------+
#|dict                      |
#+--------------------------+
#|{"_id": 1, "name": "xxx"} |
#|{"_id": 2, "name": "yyy"} |
#|{"_id": 3, "_name": "zzz"}|
#+--------------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM