简体   繁体   English

如何将数据框转换为列表(Scala)?

[英]How to convert a Dataframe into a List (Scala)?

I want to convert a Dataframe which contains Double values into a List so that I can use it to make calculations. 我想将包含Double值的Dataframe转换为List,以便可以使用它进行计算。 What is your suggestion so that I can take a correct type List (ie Double) ? 您的建议是什么,以便我可以选择正确的列表类型(即Double)?

My approach is this : 我的方法是这样的:

var newList = myDataFrame.collect().toList 

but it returns a type List[org.apache.spark.sql.Row] which I don't know what it is exactly! 但它返回类型List [org.apache.spark.sql.Row] ,我不知道它到底是什么!

Is it possible to forget that step and simply pass my Dataframe inside a function and make calculation from it? 是否有可能忘记这一步,而只是将我的Dataframe传递给函数并从中进行计算? ( For example I want to compare the third element of its second column with a specific double. Is it possible to do so directly from my Dataframe? ) 例如,我想将第二列的第三个元素与特定的double进行比较。是否可以直接从我的Dataframe中进行比较?

At any cost I have to understand how to create the right type List each time! 不惜一切代价,我必须每次都了解如何创建正确的类型列表!

EDIT: 编辑:

Input Dataframe: 输入数据框:

+---+---+ 
|_c1|_c2|
+---+---+ 
|0  |0  | 
|8  |2  | 
|9  |1  | 
|2  |9  | 
|2  |4  | 
|4  |6  | 
|3  |5  | 
|5  |3  | 
|5  |9  | 
|0  |1  | 
|8  |9  | 
|1  |0  | 
|3  |4  |
|8  |7  | 
|4  |9  | 
|2  |5  | 
|1  |9  | 
|3  |6  |
+---+---+

Result after conversion: 转换后的结果:

List((0,0), (8,2), (9,1), (2,9), (2,4), (4,6), (3,5), (5,3), (5,9), (0,1), (8,9), (1,0), (3,4), (8,7), (4,9), (2,5), (1,9), (3,6))

But every element in the List has to be Double type. 但是列表中的每个元素都必须是Double类型。

You can cast the coulmn you need to Double and convert it to RDD and collect it 您可以将所需的同伴转换为Double并将其转换为RDD并collect

If you have data that cannot be parsed then you can use udf to clean before casting it to double 如果您有无法解析的数据,则可以使用udf进行清理,然后再将其转换为double

val stringToDouble = udf((data: String) => {
  Try (data.toDouble) match {
    case Success(value) => value
    case Failure(exception) => Double.NaN
  }
})

 val df = Seq(
   ("0.000","0"),
   ("0.000008","24"),
   ("9.00000","1"),
   ("-2","xyz"),
   ("2adsfas","1.1.1")
 ).toDF("a", "b")
  .withColumn("a", stringToDouble($"a").cast(DoubleType))
  .withColumn("b", stringToDouble($"b").cast(DoubleType))

After this you will get output as 在此之后,您将获得输出为

+------+----+
|a     |b   |
+------+----+
|0.0   |0.0 |
|8.0E-6|24.0|
|9.0   |1.0 |
|-2.0  |NaN |
|NaN   |NaN |
+------+----+

To get Array[(Double, Double)] 获取Array[(Double, Double)]

val result = df.rdd.map(row => (row.getDouble(0), row.getDouble(1))).collect()

The result will be Array[(Double, Double)] 结果将是Array[(Double, Double)]

#Convert DataFrame to DataSet using case class & then convert it to list

#It'll return the list of type of your class object.All the variables inside the #class(mapping to fields in your table)will be pre-typeCasted) Then you won't need to #type cast every time.

#Please execute below code to check it-
#Sample to check & verify(scala)-

val wa = Array("one","two","two")
val wr = sc.parallelize(wa,3).map(x=>(x,"x",1))
val wdf = wr.toDF("a","b","c")
case class wc(a:String,b:String,c:Int)
val myList= wds.collect.toList
myList.foreach(x=>println(x))
myList.foreach(x=>println(x.a.getClass,x.b.getClass,x.c.getClass))

myDataFrame.select("_c1", "_c2").collect().map(each => (each.getAs[Double]("_c1"), each.getAs[Double]("_c2"))).toList

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM