[英]pyspark : Convert DataFrame to RDD[string]
I'd like to convert pyspark.sql.dataframe.DataFrame
to pyspark.rdd.RDD[String]
我想将
pyspark.sql.dataframe.DataFrame
转换为pyspark.rdd.RDD[String]
I converted a DataFrame df
to RDD data
:我将 DataFrame
df
转换为 RDD data
:
data = df.rdd
type (data)
## pyspark.rdd.RDD
the new RDD data
contains Row
新的 RDD
data
包含Row
first = data.first()
type(first)
## pyspark.sql.types.Row
data.first()
Row(_c0=u'aaa', _c1=u'bbb', _c2=u'ccc', _c3=u'ddd')
I'd like to convert Row
to list of String
, like example below:我想将
Row
转换为String
列表,如下例所示:
u'aaa',u'bbb',u'ccc',u'ddd'
Thanks谢谢
PySpark Row
is just a tuple
and can be used as such. PySpark
Row
只是一个tuple
,可以这样使用。 All you need here is a simple map
(or flatMap
if you want to flatten the rows as well) with list
:这里你需要的只是一个带有
list
的简单map
(如果你还想展flatMap
,也可以使用flatMap
):
data.map(list)
or if you expect different types:或者如果您期望不同的类型:
data.map(lambda row: [str(c) for c in row])
The accepted answer is old.接受的答案是旧的。 With Spark 2.0, you must now explicitly state that you're converting to an rdd by adding
.rdd
to the statement.对于 Spark 2.0,您现在必须通过在语句中添加
.rdd
来明确声明您正在转换为 rdd。 Therefore, the equivalent of this statement in Spark 1.0:因此,相当于 Spark 1.0 中的这条语句:
data.map(list)
Should now be:现在应该是:
data.rdd.map(list)
in Spark 2.0.在 Spark 2.0 中。 Related to the accepted answer in this post .
与在接受的答案这一职位。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.