简体   繁体   中英

How do I convert Array[Row] to RDD[Row]

I have a scenario where I want to convert the result of a dataframe which is in the format Array[Row] to RDD[Row]. I have tried using parallelize, but I don't want to use it as it needs to contain entire data in a single system which is not feasible in production box.

val Bid = spark.sql("select Distinct DeviceId, ButtonName  from stb").collect()
val bidrdd = sparkContext.parallelize(Bid)

How do I achieve this? I tried the approach given in this link ( How to convert DataFrame to RDD in Scala? ), but it didn't work for me.

val bidrdd1 = Bid.map(x => (x(0).toString, x(1).toString)).rdd

It gives an error value rdd is not a member of Array[(String, String)]

The variable Bid which you've created here is not a DataFrame, it is an Array[Row] , that's why you can't use .rdd on it. If you want to get an RDD[Row] , simply call .rdd on the DataFrame (without calling collect ):

val rdd = spark.sql("select Distinct DeviceId, ButtonName  from stb").rdd

Your post contains some misconceptions worth noting:

... a dataframe which is in the format Array[Row] ...

Not quite - the Array[Row] is the result of collecting the data from the DataFrame into Driver memory - it's not a DataFrame.

... I don't want to use it as it needs to contain entire data in a single system ...

Note that as soon as you use collect on the DataFrame, you've already collected entire data into a single JVM's memory. So using parallelize is not the issue.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM