简体   繁体   English

pyspark : 将 DataFrame 转换为 RDD[string]

[英]pyspark : Convert DataFrame to RDD[string]

I'd like to convert pyspark.sql.dataframe.DataFrame to pyspark.rdd.RDD[String]我想将pyspark.sql.dataframe.DataFrame转换为pyspark.rdd.RDD[String]

I converted a DataFrame df to RDD data :我将 DataFrame df转换为 RDD data

data = df.rdd
type (data)
## pyspark.rdd.RDD 

the new RDD data contains Row新的 RDD data包含Row

first = data.first()
type(first)
## pyspark.sql.types.Row

data.first()
Row(_c0=u'aaa', _c1=u'bbb', _c2=u'ccc', _c3=u'ddd')

I'd like to convert Row to list of String , like example below:我想将Row转换为String列表,如下例所示:

u'aaa',u'bbb',u'ccc',u'ddd'

Thanks谢谢

PySpark Row is just a tuple and can be used as such. PySpark Row只是一个tuple ,可以这样使用。 All you need here is a simple map (or flatMap if you want to flatten the rows as well) with list :这里你需要的只是一个带有list的简单map (如果你还想展flatMap ,也可以使用flatMap ):

data.map(list)

or if you expect different types:或者如果您期望不同的类型:

data.map(lambda row: [str(c) for c in row])

The accepted answer is old.接受的答案是旧的。 With Spark 2.0, you must now explicitly state that you're converting to an rdd by adding .rdd to the statement.对于 Spark 2.0,您现在必须通过在语句中添加.rdd来明确声明您正在转换为 rdd。 Therefore, the equivalent of this statement in Spark 1.0:因此,相当于 Spark 1.0 中的这条语句:

data.map(list)

Should now be:现在应该是:

data.rdd.map(list)

in Spark 2.0.在 Spark 2.0 中。 Related to the accepted answer in this post .与在接受的答案这一职位

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM