[英]How to convert file with array of double to dataframe in spark?
I am new to Scala as well as Apache Spark. 我是Scala和Apache Spark的新手。 My text file contains entries like:
我的文本文件包含以下条目:
[-0.9704851405656525,1.0286638765434661]
[-0.9704851405656525,1.0286638765434661]
[-1.0353873234576638,-0.001849782262230898]
[-0.9704851405656525,1.0286638765434661]
[-0.9704851405656525,1.0286638765434661]
....
I want to create dataframes from this. 我想从中创建数据帧。 To use sql query, My code looks something like this,
要使用sql查询,我的代码看起来像这样,
def processr(str:String) = str.replaceAll("\\[", "").replaceAll("\\]","")
case class Result(a:Double, b:Double)
val filemat = sc.textFile("mat.txt")
val result = filemat.map(s => s.split(',').map(r=>Result(processr(r[0]).toDouble, processr(r[1]).toDouble)).toDF.cache
And I get error like, 我得到的错误就像,
<console>:1: error: identifier expected but integer literal found.
val result = filemat.map(s => s.split(',').map(r=>Result(processr(r[0]).toDouble, processr(r[1]).toDouble)).toDF.cache
I am not sure, what error I am making in my code. 我不确定,我在代码中犯了什么错误。 It seems, my split method is not correct.
看来,我的拆分方法不正确。 Could anyone suggest me way to covernt into Dataframes?
任何人都可以建议我进入Dataframes吗? Thanks in advance.
提前致谢。
You should use round brackets not the square ones. 你应该使用圆括号而不是方括号。 Extraction from an array in Scala is simply an
apply
method call: 从Scala中的数组中提取只是一个
apply
方法调用:
scala> val r = "[-0.9704851405656525,1.0286638765434661]".split(",")
r: Array[String] = Array([-0.9704851405656525, 1.0286638765434661])
scala> r.apply(0)
res4: String = [-0.9704851405656525
and with some syntactic sugar: 和一些语法糖:
scala> r(0)
res5: String = [-0.9704851405656525
Next your map looks wrong. 接下来你的地图看起来不对 When you call
s.split
you get an Array[String]
so r
is actually a String
and r(0)
gives you either -
or the first digit. 当你调用
s.split
你得到一个Array[String]
所以r
实际上是一个String
而r(0)
给你-
或者第一个数字。 You probably want something like this: 你可能想要这样的东西:
filemat.map(_.split(',') match {
case Array(s1, s2) => Result(processr(s1).toDouble, processr(s2).toDouble)
})
It can be simplified either by using pattern matching with regex: 它可以通过使用正则表达式的模式匹配来简化:
val p = "^\\[(-?[0-9]+\\.[0-9]+),(-?[0-9]+\\.[0-9]+)\\]$".r
filemat.map{
case p(s1, s2) => Result(s1.toDouble, s2.toDouble)
}
or using Row.fromSeq
method: 或使用
Row.fromSeq
方法:
val schema = StructType(Seq(
StructField("a", DoubleType, false),
StructField("b", DoubleType, false)))
val p1 = "(-?[0-9]+\\.[0-9]+)".r
sqlContext.createDataFrame(filemat.map(s =>
Row.fromSeq(p1.findAllMatchIn(s).map(_.matched.toDouble).toSeq)),
schema)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.