如何将带有double数组的文件转换为spark中的dataframe？

Question

I am new to Scala as well as Apache Spark. 我是Scala和Apache Spark的新手。 My text file contains entries like: 我的文本文件包含以下条目：

[-0.9704851405656525,1.0286638765434661]
[-0.9704851405656525,1.0286638765434661]
[-1.0353873234576638,-0.001849782262230898]
[-0.9704851405656525,1.0286638765434661]
[-0.9704851405656525,1.0286638765434661]
....

I want to create dataframes from this. 我想从中创建数据帧。 To use sql query, My code looks something like this, 要使用sql查询，我的代码看起来像这样，

def processr(str:String) = str.replaceAll("\\[", "").replaceAll("\\]","")
case class Result(a:Double, b:Double)
val filemat = sc.textFile("mat.txt")
val result = filemat.map(s => s.split(',').map(r=>Result(processr(r[0]).toDouble, processr(r[1]).toDouble)).toDF.cache

And I get error like, 我得到的错误就像，

<console>:1: error: identifier expected but integer literal found.
       val result = filemat.map(s => s.split(',').map(r=>Result(processr(r[0]).toDouble, processr(r[1]).toDouble)).toDF.cache

I am not sure, what error I am making in my code. 我不确定，我在代码中犯了什么错误。 It seems, my split method is not correct. 看来，我的拆分方法不正确。 Could anyone suggest me way to covernt into Dataframes? 任何人都可以建议我进入Dataframes吗？ Thanks in advance. 提前致谢。

Answer 1

You should use round brackets not the square ones. 你应该使用圆括号而不是方括号。 Extraction from an array in Scala is simply an apply method call: 从Scala中的数组中提取只是一个apply方法调用：

scala> val r = "[-0.9704851405656525,1.0286638765434661]".split(",")
r: Array[String] = Array([-0.9704851405656525, 1.0286638765434661])

scala> r.apply(0)
res4: String = [-0.9704851405656525

and with some syntactic sugar: 和一些语法糖：

scala> r(0)
res5: String = [-0.9704851405656525

Next your map looks wrong. 接下来你的地图看起来不对 When you call s.split you get an Array[String] so r is actually a String and r(0) gives you either - or the first digit. 当你调用s.split你得到一个Array[String]所以r实际上是一个String而r(0)给你-或者第一个数字。 You probably want something like this: 你可能想要这样的东西：

filemat.map(_.split(',') match {
  case Array(s1, s2) => Result(processr(s1).toDouble, processr(s2).toDouble)
})

It can be simplified either by using pattern matching with regex: 它可以通过使用正则表达式的模式匹配来简化：

val p =  "^\\[(-?[0-9]+\\.[0-9]+),(-?[0-9]+\\.[0-9]+)\\]$".r

filemat.map{
   case p(s1, s2) => Result(s1.toDouble, s2.toDouble)
}

or using Row.fromSeq method: 或使用Row.fromSeq方法：

val schema = StructType(Seq(
  StructField("a", DoubleType, false),
  StructField("b", DoubleType, false)))

val p1 = "(-?[0-9]+\\.[0-9]+)".r

sqlContext.createDataFrame(filemat.map(s => 
   Row.fromSeq(p1.findAllMatchIn(s).map(_.matched.toDouble).toSeq)), 
   schema)

如何将带有double数组的文件转换为spark中的dataframe？

问题描述

1 个解决方案

解决方案1
5 已采纳 2015-10-08 10:32:46

如何将带有double数组的文件转换为spark中的dataframe？

问题描述

1 个解决方案

解决方案1 5 已采纳 2015-10-08 10:32:46

解决方案1
5 已采纳 2015-10-08 10:32:46