[英]Convert pyspark.sql.dataframe.DataFrame type Dataframe to Dictionary
I have a pyspark Dataframe and I need to convert this into python dictionary. 我有一个pyspark Dataframe,我需要将其转换为python字典。
Below code is reproducible: 下面的代码是可重现的:
from pyspark.sql import Row
rdd = sc.parallelize([Row(name='Alice', age=5, height=80),Row(name='Alice', age=5, height=80),Row(name='Alice', age=10, height=80)])
df = rdd.toDF()
Once I have this dataframe, I need to convert it into dictionary. 一旦我有了这个数据帧,我需要将它转换为字典。
I tried like this 我试过这样的
df.set_index('name').to_dict()
But it gives error. 但它给出了错误。 How can I achieve this 我怎样才能做到这一点
You need to first convert to a pandas.DataFrame
using toPandas()
, then you can use the to_dict()
method on the transposed dataframe with orient='list'
: 您需要先使用toPandas()
转换为pandas.DataFrame
,然后您可以使用orient='list'
在转置的数据to_dict()
上使用to_dict()
方法:
df.toPandas().set_index('name').T.to_dict('list')
# Out[1]: {u'Alice': [10, 80]}
Please see the example below: 请看下面的例子:
>>> from pyspark.sql.functions import col
>>> df = (sc.textFile('data.txt')
.map(lambda line: line.split(","))
.toDF(['name','age','height'])
.select(col('name'), col('age').cast('int'), col('height').cast('int')))
+-----+---+------+
| name|age|height|
+-----+---+------+
|Alice| 5| 80|
| Bob| 5| 80|
|Alice| 10| 80|
+-----+---+------+
>>> list_persons = map(lambda row: row.asDict(), df.collect())
>>> list_persons
[
{'age': 5, 'name': u'Alice', 'height': 80},
{'age': 5, 'name': u'Bob', 'height': 80},
{'age': 10, 'name': u'Alice', 'height': 80}
]
>>> dict_persons = {person['name']: person for person in list_persons}
>>> dict_persons
{u'Bob': {'age': 5, 'name': u'Bob', 'height': 80}, u'Alice': {'age': 10, 'name': u'Alice', 'height': 80}}
The input that I'm using to test data.txt
: 我用来测试data.txt
的输入:
Alice,5,80
Bob,5,80
Alice,10,80
First we do the loading by using pyspark by reading the lines. 首先,我们通过阅读线条使用pyspark进行加载。 Then we convert the lines to columns by splitting on the comma. 然后我们通过在逗号上拆分将行转换为列。 Then we convert the native RDD to a DF and add names to the colume. 然后我们将原生RDD转换为DF并将名称添加到colume中。 Finally we convert to columns to the appropriate format. 最后,我们将列转换为适当的格式。
Then we collect everything to the driver, and using some python list comprehension we convert the data to the form as preferred. 然后我们收集驱动程序的所有内容,并使用一些python列表理解我们将数据转换为首选的表单。 We convert the Row
object to a dictionary using the asDict()
method. 我们使用asDict()
方法将Row
对象转换为字典。 In the output we can observe that Alice is appearing only once, but this is of course because the key of Alice gets overwritten. 在输出中我们可以观察到Alice只出现一次,但这当然是因为Alice的密钥被覆盖了。
Please keep in mind that you want to do all the processing and filtering inside pypspark before returning the result to the driver. 请记住,在将结果返回给驱动程序之前,您希望在pypspark中进行所有处理和过滤。
Hope this helps, cheers. 希望这会有所帮助,欢呼。
RDDs have built in function asDict() that allows to represent each row as a dict. RDD内置函数asDict(),允许将每一行表示为dict。
If you have a dataframe df, then you need to convert it to an rdd and apply asDict(). 如果你有一个数据帧df,那么你需要将它转换为rdd并应用asDict()。
new_rdd = df.rdd.map(lambda row: row.asDict(True))
One can then use the new_rdd to perform normal python map operations like: 然后可以使用new_rdd执行正常的python map操作,例如:
# You can define normal python functions like below and plug them when needed
def transform(row):
# Add a new key to each row
row["new_key"] = "my_new_value"
return row
new_rdd = new_rdd.map(lambda row: transform(row))
如果行中嵌入了行,则可以执行df.asDict(recursive=True)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.