写一个pyspark.sql.dataframe.DataFrame不丢失信息

Question

我正在尝试以 CSV 格式保存 pyspark.sql.dataframe.DataFrame （也可以是另一种格式，只要它易于阅读）。

到目前为止，我找到了几个保存 DataFrame 的示例。 但是，每次我写它都会丢失信息。

数据集示例：

# Create an example Pyspark DataFrame

from pyspark.sql import Row

Employee = Row("firstName", "lastName", "email", "salary")
employee1 = Employee('A', 'AA', 'mail1', 100000)
employee2 = Employee('B', 'BB', 'mail2', 120000 )
employee3 = Employee('C', None, 'mail3', 140000 )
employee4 = Employee('D', 'DD', 'mail4', 160000 )
employee5 = Employee('E', 'EE', 'mail5', 160000 )

department1 = Row(id='123', name='HR')
department2 = Row(id='456', name='OPS')
department3 = Row(id='789', name='FN')
department4 = Row(id='101112', name='DEV')

departmentWithEmployees1 = Row(department=department1, employees=[employee1, employee2, employee5])
departmentWithEmployees2 = Row(department=department2, employees=[employee3, employee4])
departmentWithEmployees3 = Row(department=department3, employees=[employee1, employee4, employee3])
departmentWithEmployees4 = Row(department=department4, employees=[employee2, employee3])

departmentsWithEmployees_Seq = [departmentWithEmployees1, departmentWithEmployees2]
dframe = spark.createDataFrame(departmentsWithEmployees_Seq)

为了将此文件保存为 CSV，我首先尝试了此解决方案：

type(dframe)
Out[]: pyspark.sql.dataframe.DataFrame
dframe.write.csv('junk_mycsv.csv')

不幸的是，这导致了这个错误：

org.apache.spark.sql.AnalysisException: CSV data source does not support struct&lt;id:string,name:string&gt; data type.;

这就是我尝试另一种可能性的原因，将 spark 数据帧转换为 Pandas 数据帧，然后保存。 如本例所述。

pandas_df = dframe.toPandas()

效果很好！ 但是，如果我显示我的数据，则缺少数据：

print(pandas_df.head())

department                                          employees
0   (123, HR)  [(A, AA, mail1, 100000), (B, BB, mail2, 120000...
1  (456, OPS)  [(C, None, mail3, 140000), (D, DD, mail4, 1600...

正如您在下面的快照中看到的，我们缺少信息。 因为数据应该是这样的：

department              employees
0  id:123, name:HR      firstName: A, lastName: AA, email: mail1, salary: 100000

# Info is missing like 'id', 'name', 'firstName', 'lastName', 'email' etc. 
# For the complete expected example, see screenshow below.

仅供参考：我在使用 Python 的 Databricks 工作。

因此，如何在不丢失信息的情况下写入我的数据（上面示例中的 dframe）？

提前谢谢了！

编辑为 Pault 添加图片，以显示 csv（和标题）的格式。

Edit2替换图片例如 csv 输出：

运行 Pault 的代码后：

from pyspark.sql.functions import to_json
dframe.select(*[to_json(c).alias(c) for c in dframe.columns])\
    .repartition(1).write.csv("junk_mycsv.csv", header= True)

输出不整齐，因为大多数列标题都是空的（由于嵌套格式？）。 只复制第一行：

department           employees              (empty ColName)     (empty ColName)   (and so on)
{\id\":\"123\"       \"name\":\"HR\"}"     [{\firstName\":\"A\"  \"lastName\":\"AA\"    (...)

Answer 1

您的数据框具有以下架构：

dframe.printSchema()
#root
# |-- department: struct (nullable = true)
# |    |-- id: string (nullable = true)
# |    |-- name: string (nullable = true)
# |-- employees: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- firstName: string (nullable = true)
# |    |    |-- lastName: string (nullable = true)
# |    |    |-- email: string (nullable = true)
# |    |    |-- salary: long (nullable = true)

因此， department列是具有两个命名字段的StructType ，而employees列是具有四个命名字段的结构数组。 看起来您想要的是以保存每条记录的key和value的格式写入数据。

一种选择是以 JSON 格式而不是 CSV 格式写入文件：

dframe.write.json("junk.json")

产生以下输出：

{"department":{"id":"123","name":"HR"},"employees":[{"firstName":"A","lastName":"AA","email":"mail1","salary":100000},{"firstName":"B","lastName":"BB","email":"mail2","salary":120000},{"firstName":"E","lastName":"EE","email":"mail5","salary":160000}]}
{"department":{"id":"456","name":"OPS"},"employees":[{"firstName":"C","email":"mail3","salary":140000},{"firstName":"D","lastName":"DD","email":"mail4","salary":160000}]}

或者，如果您想将其保留为 CSV 格式，您可以在写入 CSV 之前使用to_json将每一列转换为 JSON。

# looping over all columns
# but you can also just limit this to the columns you want to convert

from pyspark.sql.functions import to_json
dframe.select(*[to_json(c).alias(c) for c in dframe.columns])\
    .write.csv("junk_mycsv.csv")

这会产生以下输出：

"{\"id\":\"123\",\"name\":\"HR\"}","[{\"firstName\":\"A\",\"lastName\":\"AA\",\"email\":\"mail1\",\"salary\":100000},{\"firstName\":\"B\",\"lastName\":\"BB\",\"email\":\"mail2\",\"salary\":120000},{\"firstName\":\"E\",\"lastName\":\"EE\",\"email\":\"mail5\",\"salary\":160000}]"
"{\"id\":\"456\",\"name\":\"OPS\"}","[{\"firstName\":\"C\",\"email\":\"mail3\",\"salary\":140000},{\"firstName\":\"D\",\"lastName\":\"DD\",\"email\":\"mail4\",\"salary\":160000}]"

请注意，双引号已转义。

写一个pyspark.sql.dataframe.DataFrame不丢失信息

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-04-02 14:11:59

写一个pyspark.sql.dataframe.DataFrame不丢失信息

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-04-02 14:11:59

解决方案1
1 已采纳 2020-04-02 14:11:59