[英]How to create a dataframe from a RDD in PySpark?
I have a RDD looks like this 我有一个RDD看起来像这样
[((0, Row(event_type_new=u'ALERT|VEHICLE_HEALTH_DATA|CHANGE_IN_HEALTH|DTC|B109F|', day=u'Fri')), 0),
((1, Row(event_type_new=u'ALERT|VEHICLE_HEALTH_DATA|CHANGE_IN_HEALTH|DTC|B1115|HIGH MOUNTED STOP LAMP CONTROL', day=u'Sat')), 2)]
which has an index, a Row object ( event_type_new
and day
), followed by a prediction (integer). 它具有一个索引,一个Row对象(
event_type_new
和day
),后跟一个预测(整数)。 How can I create a DataFrame with 3 columns including event_type_new
, day
, and Prediction
. 如何创建包含3列的
event_type_new
,包括event_type_new
, day
和Prediction
。
I am using Spark 1.6.2 with PySpark API. 我正在使用Spark 1.6.2和PySpark API。
Thanks! 谢谢!
Transform your list into RDD first. 首先将您的列表转换为RDD。 Then map each element to
Row
. 然后将每个元素映射到
Row
。 You can transform list of Row
to dataframe easily using .toDF()
method 您可以使用
.toDF()
方法轻松将Row
列表转换为.toDF()
from pyspark.sql import Row
ls = [((0, Row(event_type_new=u'ALERT|VEHICLE_HEALTH_DATA|CHANGE_IN_HEALTH|DTC|B109F|', day=u'Fri')), 0),
((1, Row(event_type_new=u'ALERT|VEHICLE_HEALTH_DATA|CHANGE_IN_HEALTH|DTC|B1115|HIGH MOUNTED STOP LAMP CONTROL', day=u'Sat')), 2)]
ls_rdd = sc.parallelize(ls)
ls_row = ls_rdd.map(lambda x: Row(**{'day': str(x[0][1].day), 'event_type': str(x[0][1].event_type_new), 'prediction': int(x[1])}))
df = ls_row.toDF()
When you run df.show()
, it will look like this: 当您运行
df.show()
,它将如下所示:
+---+--------------------+----------+
|day| event_type|prediction|
+---+--------------------+----------+
|Fri|ALERT|VEHICLE_HEA...| 0|
|Sat|ALERT|VEHICLE_HEA...| 2|
+---+--------------------+----------+
I assume that this a collected RDD
, because it looks like you got a list
with tuples of a combination of Row
and int
objects. 我以为这是一个
collected RDD
,因为它看起来像您有一个包含Row
和int
对象组合的元组的list
。 You can get your desired output with the following: 您可以使用以下命令获得所需的输出:
from pyspark.sql import Row
lst = [((0, Row(event_type_new=u'ALERT|VEHICLE_HEALTH_DATA|CHANGE_IN_HEALTH|DTC|B109F|', day=u'Fri')), 0),
((1, Row(event_type_new=u'ALERT|VEHICLE_HEALTH_DATA|CHANGE_IN_HEALTH|DTC|B1115|HIGH MOUNTED STOP LAMP CONTROL', day=u'Sat')), 2)]
output = []
for row in lst:
vals = tuple(row[0][1]) + (row[1],)
fields = row[0][1].__fields__ + ['prediction']
row = Row(*vals)
row.__fields__ = fields
output.append(row)
df = sc.parallelize(output).toDF()
df.show()
You should get something like the following: 您应该获得类似以下内容的信息:
+---+--------------------+----------+
|day| event_type_new|prediction|
+---+--------------------+----------+
|Fri|ALERT|VEHICLE_HEA...| 0|
|Sat|ALERT|VEHICLE_HEA...| 2|
+---+--------------------+----------+
I hope this helps. 我希望这有帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.