简体   繁体   English

如何从PySpark中的RDD创建数据框?

[英]How to create a dataframe from a RDD in PySpark?

I have a RDD looks like this 我有一个RDD看起来像这样

[((0, Row(event_type_new=u'ALERT|VEHICLE_HEALTH_DATA|CHANGE_IN_HEALTH|DTC|B109F|', day=u'Fri')), 0), 
 ((1, Row(event_type_new=u'ALERT|VEHICLE_HEALTH_DATA|CHANGE_IN_HEALTH|DTC|B1115|HIGH MOUNTED STOP LAMP CONTROL', day=u'Sat')), 2)]

which has an index, a Row object ( event_type_new and day ), followed by a prediction (integer). 它具有一个索引,一个Row对象( event_type_newday ),后跟一个预测(整数)。 How can I create a DataFrame with 3 columns including event_type_new , day , and Prediction . 如何创建包含3列的event_type_new ,包括event_type_newdayPrediction

I am using Spark 1.6.2 with PySpark API. 我正在使用Spark 1.6.2和PySpark API。

Thanks! 谢谢!

Transform your list into RDD first. 首先将您的列表转换为RDD。 Then map each element to Row . 然后将每个元素映射到Row You can transform list of Row to dataframe easily using .toDF() method 您可以使用.toDF()方法轻松将Row列表转换为.toDF()

from pyspark.sql import Row

ls = [((0, Row(event_type_new=u'ALERT|VEHICLE_HEALTH_DATA|CHANGE_IN_HEALTH|DTC|B109F|', day=u'Fri')), 0),
      ((1, Row(event_type_new=u'ALERT|VEHICLE_HEALTH_DATA|CHANGE_IN_HEALTH|DTC|B1115|HIGH MOUNTED STOP LAMP CONTROL', day=u'Sat')), 2)]
ls_rdd = sc.parallelize(ls)
ls_row = ls_rdd.map(lambda x: Row(**{'day': str(x[0][1].day), 'event_type': str(x[0][1].event_type_new), 'prediction': int(x[1])}))
df = ls_row.toDF()

When you run df.show() , it will look like this: 当您运行df.show() ,它将如下所示:

+---+--------------------+----------+
|day|          event_type|prediction|
+---+--------------------+----------+
|Fri|ALERT|VEHICLE_HEA...|         0|
|Sat|ALERT|VEHICLE_HEA...|         2|
+---+--------------------+----------+

I assume that this a collected RDD , because it looks like you got a list with tuples of a combination of Row and int objects. 我以为这是一个collected RDD ,因为它看起来像您有一个包含Rowint对象组合的元组的list You can get your desired output with the following: 您可以使用以下命令获得所需的输出:

from pyspark.sql import Row


lst = [((0, Row(event_type_new=u'ALERT|VEHICLE_HEALTH_DATA|CHANGE_IN_HEALTH|DTC|B109F|', day=u'Fri')), 0),
       ((1, Row(event_type_new=u'ALERT|VEHICLE_HEALTH_DATA|CHANGE_IN_HEALTH|DTC|B1115|HIGH MOUNTED STOP LAMP CONTROL', day=u'Sat')), 2)]

output = []
for row in lst:
    vals = tuple(row[0][1]) + (row[1],)
    fields = row[0][1].__fields__ + ['prediction']
    row = Row(*vals)
    row.__fields__ = fields
    output.append(row)

df = sc.parallelize(output).toDF()
df.show()

You should get something like the following: 您应该获得类似以下内容的信息:

+---+--------------------+----------+
|day|      event_type_new|prediction|
+---+--------------------+----------+
|Fri|ALERT|VEHICLE_HEA...|         0|
|Sat|ALERT|VEHICLE_HEA...|         2|
+---+--------------------+----------+

I hope this helps. 我希望这有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM