简体   繁体   中英

Tuple to data frame in spark scala

I have an array called array list which looks like this

arraylist: Array[(String, Any)] = Array((id,772914), (x4,2), (x5,24), (x6,1), (x7,77491.25), (x8,17911.77778), (x9,225711), (x10,17), (x12,6), (x14,5), (x16,5), (x18,5.0), (x19,8.0), (x20,7959.0), (x21,676.0), (x22,228.5068871), (x23,195.0), (x24,109.6015511), (x25,965.0), (x26,1017.79043), (x27,2.0), (Target,1), (x29,13), (x30,735255.5), (x31,332998.432), (x32,38168.75), (x33,107957.5278), (x34,13), (x35,13), (x36,13), (x37,13), (x38,13), (x39,13), (x40,13), (x41,7), (x42,13), (x43,13), (x44,13), (x45,13), (x46,13), (x47,13), (x48,13), (x49,14.0), (x50,2.588435821), (x51,617127.5), (x52,414663.9738), (x53,39900.0), (x54,16743.15781), (x55,105000.0), (x56,52842.29076), (x57,25750.46154), (x58,8532.045819), (x64,13), (x66,13), (x67,13), (x68,13), (x69,13), (x70,13), (x71,13), (x73,13), (...

I want to convert it to a dataframe with two columns "ID" and value. Fo theis the code I am using is

val df = sc.parallelize(arraylist).toDF("Names","Values")

However I am getting an error

java.lang.UnsupportedOperationException: Schema for type Any is not supported

How can I overcome this problem?

Message tells you everything :) Any is not supported as a type of column of DataFrame. Any type can be caused by nulls as the second element of a tuple

Change arraylist type to Array[(String, Int)] (if you can do it manually; if it is deducted by Scala, then check for nulls and invalid values of second element) or create manually schema:

import org.apache.spark.sql.types._
import org.apache.spark.sql._

val arraylist: Array[(String, Any)] = Array(("id",772914), ("x4",2.0), ("x5",24.0));

val schema = StructType(
    StructField("Names", StringType, false) ::
    StructField("Values", DoubleType, false) :: Nil)
val rdd = sc.parallelize (arraylist).map (x => Row(x._1, x._2.asInstanceOf[Number].doubleValue()))

val df = sqlContext.createDataFrame(rdd, schema)

df.show()

Note: createDataFrame requires RDD[Row], so I'm converting RDD of tuple to RDD of Row

The problem (as stated) is that Any is not a legal type to dataframe. In general legal types are primitive types (byte, int, boolean, string, double etc.), structs of legal types, arrays of legal types and maps of legal types

In your case it seems as if you used both integer and double in the second value of the tuple. If you use instead just double then it should work properly.

you can do this in two ways: 1. Make sure the original array has just double (eg by adding .0 at the end of each integer when you create it) or by doing a cast 2. Enforce the schema:

import org.apache.spark.sql.types._
val schema = new StructType()
schema.add(StructField("names",StringType))
schema.add(StructField("values",DoubleType))
val rdd = sc.parallelize(arraylist).map (x => Row(x._1, x._2.asInstanceOf[Number].doubleValue()))
val df = spark.createDataFrame(rdd,schema)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM