Pyspark将元组的RDD转换为数据框

Question

I have a rdd of tuples where the first two lines look like this: 我有一个元组的rdd，其中前两行如下所示：

[[('n', 12.012457082117459), ('s', 0.79112758892014912)],
[('t', 3.6243409329763652),('vn', 3.6243409329763652),('n', 52.743253562212828),('v', 11.644347760553064)]]

In each tuple, the first value, eg: 'n','s','t', is the desired column name, and the second value, eg: 12.012, 0.7911.... is the desired values for each column. 在每个元组中，第一个值（例如：'n'，'s'，'t'）是所需的列名，第二个值（例如：12.012、0.7911 ....）是每个列的所需值。 However, in each list(row) of rdd, we can see that not all column names are there. 但是，在rdd的每个列表（行）中，我们可以看到并不是所有的列名都存在。 For example, in the first row, only 例如，在第一行中，仅

'n', 's'

appeared, while there is no 出现，而没有

's'

in the second row. 在第二行。 So I want to convert this rdd to a dataframe, where the values should be 0 for columns that do not show up in the original tuple. 因此，我想将此rdd转换为数据框，其中对于未显示在原始元组中的列，其值应为0。 In other words, the first two rows might look like this: 换句话说，前两行可能看起来像这样：

n     s      t       vn     omitted.....
12    0.79   0       0      ..... 
52    0      3.62    3.62    .......

I tried following: 我尝试了以下操作：

row = Row('l','eng','q','g','j','b','nt','z','n','d','f','i','k','s','vn','nz','v','nrt','tg','nrfg','t','ng','zg','a')
df = tup_sum_data.map(row).toDF()

Where strings in Row() are my desired column names. Row（）中的字符串是我想要的列名。 But i got following error: 但我得到以下错误：

TypeError                                 Traceback (most recent call last)
/Users/1/Documents/spark/python/pyspark/sql/types.py in _infer_type(obj)
968         try:
--> 969             return _infer_schema(obj)
970         except TypeError:

/Users/1/Documents/spark/python/pyspark/sql/types.py in _infer_schema(row)
991     else:
--> 992         raise TypeError("Can not infer schema for type: %s" % type(row))
993 

TypeError: Can not infer schema for type: <class 'numpy.float64'>
During handling of the above exception, another exception occurred:
TypeError                                 Traceback (most recent call last)
/Users/1/Documents/spark/python/pyspark/sql/types.py in _infer_type(obj)
968         try:
--> 969             return _infer_schema(obj)
970         except TypeError:

/Users/1/Documents/spark/python/pyspark/sql/types.py in _infer_type(obj)
969             return _infer_schema(obj)
970         except TypeError:
--> 971             raise TypeError("not supported type: %s" % type(obj))
972 
973 

TypeError: not supported type: <class 'tuple'>

Some lines in the error codes are omitted. 错误代码中的某些行被省略。 Could anyone help me figure out how to deal with this? 谁能帮我找出解决方法？ Thank you ! 谢谢！

UPDATE I converted data types from np.float64 to float, and there is no error. UPDATE我将数据类型从np.float64转换为float，并且没有错误。 However, the dataframe does not look like what I wanted; 但是，数据框看起来并不像我想要的那样。 it looked like this: 它看起来像这样：

+--------------------+
|                   l|
+--------------------+
|[[n,12.0124570821...|
|[[t,3.62434093297...|
|[[a,0.44628710262...|
|[[n,16.7534769832...|
|[[n,17.6017774340...|
+--------------------+
only showing top 5 rows

So can anyone help me how to get a correctly formatted dataframe? 那么有人可以帮助我如何获取格式正确的数据框吗？ Thank you ! 谢谢！

Answer 1

from pyspark.sql.types import *
from pyspark.sql import *

data_frame_schema = StructType([
    StructField("n", FloatType()),
    StructField("s", FloatType()),
    StructField("t", FloatType()),
    StructField("v", FloatType()),
    StructField("vn", FloatType())
])

raw_list = [[('n', 12.012457082117459), ('s', 0.79112758892014912)], \
[('t', 3.6243409329763652),('vn', 3.6243409329763652),('n', 52.743253562212828),('v', 11.644347760553064)]]

raw_rdd = sc.parallelize(raw_list)

# dict_to_row = lambda d: Row(n=d.get("n"), s=d.get("s"), t=d.get("t"), v=d.get("v"), vn=d.get("vn"))
dict_to_row = lambda d: Row(n=d.get("n", 0.0), s=d.get("s", 0.0), t=d.get("t", 0.0), v=d.get("v", 0.0), vn=d.get("vn", 0.0))

row_rdd = raw_rdd.map(lambda l: dict_to_row(dict(l)))
df = spark.createDataFrame(row_rdd, data_frame_schema)
df.show()

Pasting the above into the pyspark shell yields output: 将以上内容粘贴到pyspark shell中会产生输出：

+---------+----------+--------+---------+--------+
|        n|         s|       t|        v|      vn|
+---------+----------+--------+---------+--------+
|12.012457|0.79112756|     0.0|      0.0|     0.0|
| 52.74325|       0.0|3.624341|11.644348|3.624341|
+---------+----------+--------+---------+--------+

Pyspark将元组的RDD转换为数据框

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-06-01 18:04:09

Pyspark将元组的RDD转换为数据框

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-06-01 18:04:09

解决方案1
1 已采纳 2017-06-01 18:04:09