简体   繁体   English

将pyspark.sql.dataframe.DataFrame类型转换为Dictionary

[英]Convert pyspark.sql.dataframe.DataFrame type Dataframe to Dictionary

I have a pyspark Dataframe and I need to convert this into python dictionary. 我有一个pyspark Dataframe,我需要将其转换为python字典。

Below code is reproducible: 下面的代码是可重现的:

from pyspark.sql import Row
rdd = sc.parallelize([Row(name='Alice', age=5, height=80),Row(name='Alice', age=5, height=80),Row(name='Alice', age=10, height=80)])
df = rdd.toDF()

Once I have this dataframe, I need to convert it into dictionary. 一旦我有了这个数据帧,我需要将它转换为字典。

I tried like this 我试过这样的

df.set_index('name').to_dict()

But it gives error. 但它给出了错误。 How can I achieve this 我怎样才能做到这一点

You need to first convert to a pandas.DataFrame using toPandas() , then you can use the to_dict() method on the transposed dataframe with orient='list' : 您需要先使用toPandas()转换为pandas.DataFrame ,然后您可以使用orient='list'在转置的数据to_dict()上使用to_dict()方法:

df.toPandas().set_index('name').T.to_dict('list')
# Out[1]: {u'Alice': [10, 80]}

Please see the example below: 请看下面的例子:

>>> from pyspark.sql.functions import col
>>> df = (sc.textFile('data.txt')
            .map(lambda line: line.split(","))
            .toDF(['name','age','height'])
            .select(col('name'), col('age').cast('int'), col('height').cast('int')))

+-----+---+------+
| name|age|height|
+-----+---+------+
|Alice|  5|    80|
|  Bob|  5|    80|
|Alice| 10|    80|
+-----+---+------+

>>> list_persons = map(lambda row: row.asDict(), df.collect())
>>> list_persons
[
    {'age': 5, 'name': u'Alice', 'height': 80}, 
    {'age': 5, 'name': u'Bob', 'height': 80}, 
    {'age': 10, 'name': u'Alice', 'height': 80}
]

>>> dict_persons = {person['name']: person for person in list_persons}
>>> dict_persons
{u'Bob': {'age': 5, 'name': u'Bob', 'height': 80}, u'Alice': {'age': 10, 'name': u'Alice', 'height': 80}}

The input that I'm using to test data.txt : 我用来测试data.txt的输入:

Alice,5,80
Bob,5,80
Alice,10,80

First we do the loading by using pyspark by reading the lines. 首先,我们通过阅读线条使用pyspark进行加载。 Then we convert the lines to columns by splitting on the comma. 然后我们通过在逗号上拆分将行转换为列。 Then we convert the native RDD to a DF and add names to the colume. 然后我们将原生RDD转换为DF并将名称添加到colume中。 Finally we convert to columns to the appropriate format. 最后,我们将列转换为适当的格式。

Then we collect everything to the driver, and using some python list comprehension we convert the data to the form as preferred. 然后我们收集驱动程序的所有内容,并使用一些python列表理解我们将数据转换为首选的表单。 We convert the Row object to a dictionary using the asDict() method. 我们使用asDict()方法将Row对象转换为字典。 In the output we can observe that Alice is appearing only once, but this is of course because the key of Alice gets overwritten. 在输出中我们可以观察到Alice只出现一次,但这当然是因为Alice的密钥被覆盖了。

Please keep in mind that you want to do all the processing and filtering inside pypspark before returning the result to the driver. 请记住,在将结果返回给驱动程序之前,您希望在pypspark中进行所有处理和过滤。

Hope this helps, cheers. 希望这会有所帮助,欢呼。

RDDs have built in function asDict() that allows to represent each row as a dict. RDD内置函数asDict(),允许将每一行表示为dict。

If you have a dataframe df, then you need to convert it to an rdd and apply asDict(). 如果你有一个数据帧df,那么你需要将它转换为rdd并应用asDict()。

new_rdd = df.rdd.map(lambda row: row.asDict(True))

One can then use the new_rdd to perform normal python map operations like: 然后可以使用new_rdd执行正常的python map操作,例如:

# You can define normal python functions like below and plug them when needed
def transform(row):
    # Add a new key to each row
    row["new_key"] = "my_new_value"
    return row

new_rdd = new_rdd.map(lambda row: transform(row))

如果行中嵌入了行,则可以执行df.asDict(recursive=True)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pyspark:依靠 pyspark.sql.dataframe.DataFrame 需要很长时间 - Pyspark: count on pyspark.sql.dataframe.DataFrame takes long time 如何将pyspark.sql.dataframe.DataFrame转换回databricks笔记本中的sql表 - How can I convert a pyspark.sql.dataframe.DataFrame back to a sql table in databricks notebook 写一个pyspark.sql.dataframe.DataFrame不丢失信息 - Write a pyspark.sql.dataframe.DataFrame without losing information difference between pyspark.pandas.frame.DataFrame and pyspark.sql.dataframe.DataFrame and their conversion - difference between pyspark.pandas.frame.DataFrame and pyspark.sql.dataframe.DataFrame and their conversion Pyspark:如何从 pyspark.sql.dataframe.DataFrame 中选择唯一的 ID 数据? - Pyspark: how to select unique ID data from a pyspark.sql.dataframe.DataFrame? Pyspark:如何将在线.gz日志文件加载到pyspark.sql.dataframe.DataFrame中 - Pyspark: how to load online .gz log file into pyspark.sql.dataframe.DataFrame 如何在不使用 pandas on spark API 的情况下为 pyspark.sql.dataframe.DataFrame 编写这个 pandas 逻辑? - How to write this pandas logic for pyspark.sql.dataframe.DataFrame without using pandas on spark API? 尝试在 Databricks 环境中合并或连接两个 pyspark.sql.dataframe.DataFrame - Trying to Merge or Concat two pyspark.sql.dataframe.DataFrame in Databricks Environment 将有序字典转换为PySpark数据框 - Convert Ordered Dictionary to PySpark Dataframe 将嵌套字典转换为 Pyspark 数据框 - Convert Nested dictionary to Pyspark Dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM