将字典另存为 pyspark Dataframe 并加载它 - Python，Databricks

Question

I have a dictionary as follows:我有一本字典如下：

my_dict = {'a':[12,15.2,52.1],'b':[2.5,2.4,5.2],'c':[1.2,5.3,12]}

I want to save this dictionary in Databricks in order for me not to obtain it every time I want to start working with it.我想将这本字典保存在 Databricks 中，以免每次我想开始使用它时都无法获取它。 Furthermore, I would like to know how to retrieve it and have it in its original form again.此外，我想知道如何检索它并再次以原始形式保存它。

I have tried doing the following:我尝试过执行以下操作：

from itertools import zip_longest 

column_names, data = zip(*my_dict.items())
spark.createDataFrame(zip_longest(*data), column_names).show()

and和

column_names, data = zip(*dict_brands.items())

spark.createDataFrame(zip(*data), column_names).show()

However, I get the following error:但是，我收到以下错误：

zip_longest argument #10342 must support iteration zip_longest 参数 #10342 必须支持迭代

I also do not know how to reload it or upload it.我也不知道如何重新加载或上传它。 I tried with a sample dataframe (not the same one), as follows:我尝试了一个示例 dataframe（不是同一个），如下所示：

df.write.format("tfrecords").mode("overwrite").save('/data/tmp/my_df')

And the error is:错误是：

Attribute name "my_column" contains invalid character(s) among ",;{}()\n\t=".属性名称“my_column”在“,;{}()\n\t=" 中包含无效字符。 Please use alias to rename it.请使用别名重命名。

Finally, in order to obtain it, I thought about:最后，为了得到它，我想到了：

my_df = spark.table("my_df") # Get table
df = my_df.toPandas() # Make pd dataframe

and then make it a dictionary, but maybe there is an easier way than making it a dataframe and then retrieving as dataframe and converting into dictionary back again.然后将其设为字典，但也许有比将其设为 dataframe 然后检索为 dataframe 并再次转换为字典更简单的方法。

I would also like to know the computational cost for the solutions, since the actual dataset is very large.我还想知道解决方案的计算成本，因为实际数据集非常大。

Answer 1

Here is my sample code for realizing your needs step by step.这是我逐步实现您的需求的示例代码。

Convert a dictionary to a Pandas dataframe将字典转换为 Pandas dataframe

 my_dict = {'a':[12,15.2,52.1],'b':[2.5,2.4,5.2],'c':[1.2,5.3,12]} import pandas as pd pdf = pd.DataFrame(my_dict)

Convert a Pandas dataframe to a PySpark dataframe将 Pandas dataframe 转换为 PySpark Z6A8064B5DF479455500553C47C5505
```
 df = spark.createDataFrame(pdf)
```
To save a PySpark dataframe to a file using parquet format.使用parquet格式将 PySpark dataframe 保存到文件中。 Format tfrecords is not supported at here.此处不支持tfrecords格式。
```
 df.write.format("parquet").mode("overwrite").save('/data/tmp/my_df')
```
To load the saved file above as a PySpark dataframe.将上面保存的文件加载为 PySpark dataframe。
```
 df2 = spark.read.format("parquet").load('/data/tmp/my_df')
```
To convet a PySpark dataframe to a dictionary.将 PySpark dataframe 转换为字典。
```
 my_dict2 = df2.toPandas().to_dict()
```

The computational cost of these code above is depended on the memory usage for your actual dataset.上面这些代码的计算成本取决于实际数据集的 memory 使用情况。

将字典另存为 pyspark Dataframe 并加载它 - Python，Databricks

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-11-21 08:16:40

将字典另存为 pyspark Dataframe 并加载它 - Python，Databricks

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-11-21 08:16:40

解决方案1
2 已采纳 2019-11-21 08:16:40