简体   繁体   English

如何将带有double数组的文件转换为spark中的dataframe?

[英]How to convert file with array of double to dataframe in spark?

I am new to Scala as well as Apache Spark. 我是Scala和Apache Spark的新手。 My text file contains entries like: 我的文本文件包含以下条目:

[-0.9704851405656525,1.0286638765434661]
[-0.9704851405656525,1.0286638765434661]
[-1.0353873234576638,-0.001849782262230898]
[-0.9704851405656525,1.0286638765434661]
[-0.9704851405656525,1.0286638765434661]
....

I want to create dataframes from this. 我想从中创建数据帧。 To use sql query, My code looks something like this, 要使用sql查询,我的代码看起来像这样,

def processr(str:String) = str.replaceAll("\\[", "").replaceAll("\\]","")
case class Result(a:Double, b:Double)
val filemat = sc.textFile("mat.txt")
val result = filemat.map(s => s.split(',').map(r=>Result(processr(r[0]).toDouble, processr(r[1]).toDouble)).toDF.cache

And I get error like, 我得到的错误就像,

<console>:1: error: identifier expected but integer literal found.
       val result = filemat.map(s => s.split(',').map(r=>Result(processr(r[0]).toDouble, processr(r[1]).toDouble)).toDF.cache

I am not sure, what error I am making in my code. 我不确定,我在代码中犯了什么错误。 It seems, my split method is not correct. 看来,我的拆分方法不正确。 Could anyone suggest me way to covernt into Dataframes? 任何人都可以建议我进入Dataframes吗? Thanks in advance. 提前致谢。

You should use round brackets not the square ones. 你应该使用圆括号而不是方括号。 Extraction from an array in Scala is simply an apply method call: 从Scala中的数组中提取只是一个apply方法调用:

scala> val r = "[-0.9704851405656525,1.0286638765434661]".split(",")
r: Array[String] = Array([-0.9704851405656525, 1.0286638765434661])

scala> r.apply(0)
res4: String = [-0.9704851405656525

and with some syntactic sugar: 和一些语法糖:

scala> r(0)
res5: String = [-0.9704851405656525

Next your map looks wrong. 接下来你的地图看起来不对 When you call s.split you get an Array[String] so r is actually a String and r(0) gives you either - or the first digit. 当你调用s.split你得到一个Array[String]所以r实际上是一个Stringr(0)给你-或者第一个数字。 You probably want something like this: 你可能想要这样的东西:

filemat.map(_.split(',') match {
  case Array(s1, s2) => Result(processr(s1).toDouble, processr(s2).toDouble)
})

It can be simplified either by using pattern matching with regex: 它可以通过使用正则表达式的模式匹配来简化:

val p =  "^\\[(-?[0-9]+\\.[0-9]+),(-?[0-9]+\\.[0-9]+)\\]$".r

filemat.map{
   case p(s1, s2) => Result(s1.toDouble, s2.toDouble)
}

or using Row.fromSeq method: 或使用Row.fromSeq方法:

val schema = StructType(Seq(
  StructField("a", DoubleType, false),
  StructField("b", DoubleType, false)))

val p1 = "(-?[0-9]+\\.[0-9]+)".r

sqlContext.createDataFrame(filemat.map(s => 
   Row.fromSeq(p1.findAllMatchIn(s).map(_.matched.toDouble).toSeq)), 
   schema)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM