简体   繁体   English

元组到火花scala中的数据框架

[英]Tuple to data frame in spark scala

I have an array called array list which looks like this 我有一个名为数组列表的数组,看起来像这样

arraylist: Array[(String, Any)] = Array((id,772914), (x4,2), (x5,24), (x6,1), (x7,77491.25), (x8,17911.77778), (x9,225711), (x10,17), (x12,6), (x14,5), (x16,5), (x18,5.0), (x19,8.0), (x20,7959.0), (x21,676.0), (x22,228.5068871), (x23,195.0), (x24,109.6015511), (x25,965.0), (x26,1017.79043), (x27,2.0), (Target,1), (x29,13), (x30,735255.5), (x31,332998.432), (x32,38168.75), (x33,107957.5278), (x34,13), (x35,13), (x36,13), (x37,13), (x38,13), (x39,13), (x40,13), (x41,7), (x42,13), (x43,13), (x44,13), (x45,13), (x46,13), (x47,13), (x48,13), (x49,14.0), (x50,2.588435821), (x51,617127.5), (x52,414663.9738), (x53,39900.0), (x54,16743.15781), (x55,105000.0), (x56,52842.29076), (x57,25750.46154), (x58,8532.045819), (x64,13), (x66,13), (x67,13), (x68,13), (x69,13), (x70,13), (x71,13), (x73,13), (...

I want to convert it to a dataframe with two columns "ID" and value. 我想将它转换为具有两列“ID”和值的数据帧。 Fo theis the code I am using is 这就是我正在使用的代码

val df = sc.parallelize(arraylist).toDF("Names","Values")

However I am getting an error 但是我收到了一个错误

java.lang.UnsupportedOperationException: Schema for type Any is not supported

How can I overcome this problem? 我怎样才能克服这个问题?

Message tells you everything :) Any is not supported as a type of column of DataFrame. 消息告诉您所有内容:)任何不支持作为DataFrame列的类型。 Any type can be caused by nulls as the second element of a tuple Any类型都可以由null作为元组的第二个元素引起

Change arraylist type to Array[(String, Int)] (if you can do it manually; if it is deducted by Scala, then check for nulls and invalid values of second element) or create manually schema: 将arraylist类型更改为Array[(String, Int)] (如果您可以手动执行;如果它由Scala扣除,则检查空值和第二个元素的无效值)或手动创建架构:

import org.apache.spark.sql.types._
import org.apache.spark.sql._

val arraylist: Array[(String, Any)] = Array(("id",772914), ("x4",2.0), ("x5",24.0));

val schema = StructType(
    StructField("Names", StringType, false) ::
    StructField("Values", DoubleType, false) :: Nil)
val rdd = sc.parallelize (arraylist).map (x => Row(x._1, x._2.asInstanceOf[Number].doubleValue()))

val df = sqlContext.createDataFrame(rdd, schema)

df.show()

Note: createDataFrame requires RDD[Row], so I'm converting RDD of tuple to RDD of Row 注意:createDataFrame需要RDD [Row],所以我将元组的RDD转换为Row的RDD

The problem (as stated) is that Any is not a legal type to dataframe. 问题(如上所述)是Any不是数据帧的合法类型。 In general legal types are primitive types (byte, int, boolean, string, double etc.), structs of legal types, arrays of legal types and maps of legal types 一般来说,合法类型是基本类型(byte,int,boolean,string,double等),合法类型的结构,合法类型的数组和合法类型的映射

In your case it seems as if you used both integer and double in the second value of the tuple. 在你的情况下,似乎你在元组的第二个值中使用了整数和双精度。 If you use instead just double then it should work properly. 如果您使用的只是双倍,那么它应该正常工作。

you can do this in two ways: 1. Make sure the original array has just double (eg by adding .0 at the end of each integer when you create it) or by doing a cast 2. Enforce the schema: 您可以通过两种方式执行此操作:1。确保原始数组只有两倍(例如,在创建它时在每个整数的末尾添加.0)或执行强制转换2.强制执行模式:

import org.apache.spark.sql.types._
val schema = new StructType()
schema.add(StructField("names",StringType))
schema.add(StructField("values",DoubleType))
val rdd = sc.parallelize(arraylist).map (x => Row(x._1, x._2.asInstanceOf[Number].doubleValue()))
val df = spark.createDataFrame(rdd,schema)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM