[英]How to write this pandas logic for pyspark.sql.dataframe.DataFrame without using pandas on spark API?
[英]Write a pyspark.sql.dataframe.DataFrame without losing information
我正在嘗試以 CSV 格式保存 pyspark.sql.dataframe.DataFrame (也可以是另一種格式,只要它易於閱讀)。
到目前為止,我找到了幾個保存 DataFrame 的示例。 但是,每次我寫它都會丟失信息。
數據集示例:
# Create an example Pyspark DataFrame
from pyspark.sql import Row
Employee = Row("firstName", "lastName", "email", "salary")
employee1 = Employee('A', 'AA', 'mail1', 100000)
employee2 = Employee('B', 'BB', 'mail2', 120000 )
employee3 = Employee('C', None, 'mail3', 140000 )
employee4 = Employee('D', 'DD', 'mail4', 160000 )
employee5 = Employee('E', 'EE', 'mail5', 160000 )
department1 = Row(id='123', name='HR')
department2 = Row(id='456', name='OPS')
department3 = Row(id='789', name='FN')
department4 = Row(id='101112', name='DEV')
departmentWithEmployees1 = Row(department=department1, employees=[employee1, employee2, employee5])
departmentWithEmployees2 = Row(department=department2, employees=[employee3, employee4])
departmentWithEmployees3 = Row(department=department3, employees=[employee1, employee4, employee3])
departmentWithEmployees4 = Row(department=department4, employees=[employee2, employee3])
departmentsWithEmployees_Seq = [departmentWithEmployees1, departmentWithEmployees2]
dframe = spark.createDataFrame(departmentsWithEmployees_Seq)
為了將此文件保存為 CSV,我首先嘗試了此解決方案:
type(dframe)
Out[]: pyspark.sql.dataframe.DataFrame
dframe.write.csv('junk_mycsv.csv')
不幸的是,這導致了這個錯誤:
org.apache.spark.sql.AnalysisException: CSV data source does not support struct<id:string,name:string> data type.;
這就是我嘗試另一種可能性的原因,將 spark 數據幀轉換為 Pandas 數據幀,然后保存。 如本例所述。
pandas_df = dframe.toPandas()
效果很好! 但是,如果我顯示我的數據,則缺少數據:
print(pandas_df.head())
department employees
0 (123, HR) [(A, AA, mail1, 100000), (B, BB, mail2, 120000...
1 (456, OPS) [(C, None, mail3, 140000), (D, DD, mail4, 1600...
正如您在下面的快照中看到的,我們缺少信息。 因為數據應該是這樣的:
department employees
0 id:123, name:HR firstName: A, lastName: AA, email: mail1, salary: 100000
# Info is missing like 'id', 'name', 'firstName', 'lastName', 'email' etc.
# For the complete expected example, see screenshow below.
僅供參考:我在使用 Python 的 Databricks 工作。
因此,如何在不丟失信息的情況下寫入我的數據(上面示例中的 dframe)?
提前謝謝了!
編輯為 Pault 添加圖片,以顯示 csv(和標題)的格式。
Edit2替換圖片例如 csv 輸出:
運行 Pault 的代碼后:
from pyspark.sql.functions import to_json
dframe.select(*[to_json(c).alias(c) for c in dframe.columns])\
.repartition(1).write.csv("junk_mycsv.csv", header= True)
輸出不整齊,因為大多數列標題都是空的(由於嵌套格式?)。 只復制第一行:
department employees (empty ColName) (empty ColName) (and so on)
{\id\":\"123\" \"name\":\"HR\"}" [{\firstName\":\"A\" \"lastName\":\"AA\" (...)
您的數據框具有以下架構:
dframe.printSchema()
#root
# |-- department: struct (nullable = true)
# | |-- id: string (nullable = true)
# | |-- name: string (nullable = true)
# |-- employees: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- firstName: string (nullable = true)
# | | |-- lastName: string (nullable = true)
# | | |-- email: string (nullable = true)
# | | |-- salary: long (nullable = true)
因此, department
列是具有兩個命名字段的StructType
,而employees
列是具有四個命名字段的結構數組。 看起來您想要的是以保存每條記錄的key
和value
的格式寫入數據。
一種選擇是以 JSON 格式而不是 CSV 格式寫入文件:
dframe.write.json("junk.json")
產生以下輸出:
{"department":{"id":"123","name":"HR"},"employees":[{"firstName":"A","lastName":"AA","email":"mail1","salary":100000},{"firstName":"B","lastName":"BB","email":"mail2","salary":120000},{"firstName":"E","lastName":"EE","email":"mail5","salary":160000}]}
{"department":{"id":"456","name":"OPS"},"employees":[{"firstName":"C","email":"mail3","salary":140000},{"firstName":"D","lastName":"DD","email":"mail4","salary":160000}]}
或者,如果您想將其保留為 CSV 格式,您可以在寫入 CSV 之前使用to_json
將每一列轉換為 JSON。
# looping over all columns
# but you can also just limit this to the columns you want to convert
from pyspark.sql.functions import to_json
dframe.select(*[to_json(c).alias(c) for c in dframe.columns])\
.write.csv("junk_mycsv.csv")
這會產生以下輸出:
"{\"id\":\"123\",\"name\":\"HR\"}","[{\"firstName\":\"A\",\"lastName\":\"AA\",\"email\":\"mail1\",\"salary\":100000},{\"firstName\":\"B\",\"lastName\":\"BB\",\"email\":\"mail2\",\"salary\":120000},{\"firstName\":\"E\",\"lastName\":\"EE\",\"email\":\"mail5\",\"salary\":160000}]"
"{\"id\":\"456\",\"name\":\"OPS\"}","[{\"firstName\":\"C\",\"email\":\"mail3\",\"salary\":140000},{\"firstName\":\"D\",\"lastName\":\"DD\",\"email\":\"mail4\",\"salary\":160000}]"
請注意,雙引號已轉義。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.