简体   繁体   English

将字典另存为 pyspark Dataframe 并加载它 - Python,Databricks

[英]Save dictionary as a pyspark Dataframe and load it - Python, Databricks

I have a dictionary as follows:我有一本字典如下:

my_dict = {'a':[12,15.2,52.1],'b':[2.5,2.4,5.2],'c':[1.2,5.3,12]}

I want to save this dictionary in Databricks in order for me not to obtain it every time I want to start working with it.我想将这本字典保存在 Databricks 中,以免每次我想开始使用它时都无法获取它。 Furthermore, I would like to know how to retrieve it and have it in its original form again.此外,我想知道如何检索它并再次以原始形式保存它。

I have tried doing the following:我尝试过执行以下操作:

from itertools import zip_longest 

column_names, data = zip(*my_dict.items())
spark.createDataFrame(zip_longest(*data), column_names).show()

and

column_names, data = zip(*dict_brands.items())

spark.createDataFrame(zip(*data), column_names).show()

However, I get the following error:但是,我收到以下错误:

zip_longest argument #10342 must support iteration zip_longest 参数 #10342 必须支持迭代

I also do not know how to reload it or upload it.我也不知道如何重新加载或上传它。 I tried with a sample dataframe (not the same one), as follows:我尝试了一个示例 dataframe(不是同一个),如下所示:

df.write.format("tfrecords").mode("overwrite").save('/data/tmp/my_df')

And the error is:错误是:

Attribute name "my_column" contains invalid character(s) among ",;{}()\n\t=".属性名称“my_column”在“,;{}()\n\t=" 中包含无效字符。 Please use alias to rename it.请使用别名重命名。

Finally, in order to obtain it, I thought about:最后,为了得到它,我想到了:

my_df = spark.table("my_df") # Get table
df = my_df.toPandas() # Make pd dataframe

and then make it a dictionary, but maybe there is an easier way than making it a dataframe and then retrieving as dataframe and converting into dictionary back again.然后将其设为字典,但也许有比将其设为 dataframe 然后检索为 dataframe 并再次转换为字典更简单的方法。

I would also like to know the computational cost for the solutions, since the actual dataset is very large.我还想知道解决方案的计算成本,因为实际数据集非常大。

Here is my sample code for realizing your needs step by step.这是我逐步实现您的需求的示例代码。

  1. Convert a dictionary to a Pandas dataframe将字典转换为 Pandas dataframe

     my_dict = {'a':[12,15.2,52.1],'b':[2.5,2.4,5.2],'c':[1.2,5.3,12]} import pandas as pd pdf = pd.DataFrame(my_dict)

    在此处输入图像描述

  2. Convert a Pandas dataframe to a PySpark dataframe将 Pandas dataframe 转换为 PySpark Z6A8064B5DF479455500553C47C5505

     df = spark.createDataFrame(pdf)

    在此处输入图像描述

  3. To save a PySpark dataframe to a file using parquet format.使用parquet格式将 PySpark dataframe 保存到文件中。 Format tfrecords is not supported at here.此处不支持tfrecords格式。

     df.write.format("parquet").mode("overwrite").save('/data/tmp/my_df')

    在此处输入图像描述

  4. To load the saved file above as a PySpark dataframe.将上面保存的文件加载为 PySpark dataframe。

     df2 = spark.read.format("parquet").load('/data/tmp/my_df')

    在此处输入图像描述

  5. To convet a PySpark dataframe to a dictionary.将 PySpark dataframe 转换为字典。

     my_dict2 = df2.toPandas().to_dict()

    在此处输入图像描述

The computational cost of these code above is depended on the memory usage for your actual dataset.上面这些代码的计算成本取决于实际数据集的 memory 使用情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM