简体   繁体   中英

scala dataframe to RDD[array[String]]

How to convert dataframe with multiple columns I can get RDD[org.apache.spark.sql.Row], but I'd need something that I could use for org.apache.spark.mllib.fpm.FPGrowth, ei RDD[Array[String]] How to convert?

df.head
org.apache.spark.sql.Row = [blabla,128323,23843,11.23,blabla,null,null,..]

df.printSchema    
 |-- source: string (nullable = true)
 |-- b1: string (nullable = true)
 |-- b2: string (nullable = true)
 |-- b3: long (nullable = true)
 |-- amount: decimal(30,2) (nullable = true)
and so on

Thanks

Question is vague, but in general, you can change the RDD from Row to Array passing through Sequence. The following code will take all columns from an RDD, convert them to string, and returning them as an array.

df.first
res1: org.apache.spark.sql.Row = [blah1,blah2]
df.map { _.toSeq.map {_.toString}.toArray }.first
res2: Array[String] = Array(blah1, blah2)

This however may not be enough to get it to work with MLib the way you want since you didn't give enough detail, but it's a start.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM