pyspark : Convert DataFrame to RDD[string]

Question

I'd like to convert pyspark.sql.dataframe.DataFrame to pyspark.rdd.RDD[String]

I converted a DataFrame df to RDD data :

data = df.rdd
type (data)
## pyspark.rdd.RDD

the new RDD data contains Row

first = data.first()
type(first)
## pyspark.sql.types.Row

data.first()
Row(_c0=u'aaa', _c1=u'bbb', _c2=u'ccc', _c3=u'ddd')

I'd like to convert Row to list of String , like example below:

u'aaa',u'bbb',u'ccc',u'ddd'

Thanks

Answer 1

PySpark Row is just a tuple and can be used as such. All you need here is a simple map (or flatMap if you want to flatten the rows as well) with list :

data.map(list)

or if you expect different types:

data.map(lambda row: [str(c) for c in row])

Answer 2

The accepted answer is old. With Spark 2.0, you must now explicitly state that you're converting to an rdd by adding .rdd to the statement. Therefore, the equivalent of this statement in Spark 1.0:

data.map(list)

Should now be:

data.rdd.map(list)

in Spark 2.0. Related to the accepted answer in this post .

pyspark : Convert DataFrame to RDD[string]

Question

2 answers

solution1
14 ACCPTED 2016-02-17 13:35:16

solution2
0 2019-11-14 00:54:55

pyspark : Convert DataFrame to RDD[string]

Question

2 answers

solution1 14 ACCPTED 2016-02-17 13:35:16

solution2 0 2019-11-14 00:54:55

solution1
14 ACCPTED 2016-02-17 13:35:16

solution2
0 2019-11-14 00:54:55