[英]Spark scala reading text file with map and filter
I've a text file with following format ( id,f1,f2,f3,...,fn
): 我有一个以下格式的文本文件( id,f1,f2,f3,...,fn
):
12345,0,0,1,2,...,3
23456,0,0,1,2,...,0
33333,0,1,1,0,...,0
56789,1,0,0,0,...,4
a_123,0,0,0,6,...,3
And I want to read the file (ignore the line like a_123,0,0,0,6,...,3
) to create a RDD[(Long, Vector)
. 我想读取文件(忽略像a_123,0,0,0,6,...,3
这样的行)以创建RDD[(Long, Vector)
。 Here's my solution: 这是我的解决方案:
def readDataset(path: String, sparkSession: SparkSession): RDD[(ItemId, Vector)] = {
val sc = sparkSession.sparkContext
sc.textFile(path)
.map({ line => val values=line.split(",")
(
values(0).toLong,
//util.Try(values(0).toLong).getOrElse(0L),
Vectors.dense(values.slice(1, values.length).map {x => x.toDouble }).toSparse
)})
.filter(x => x._1 > 0)
}
However this code can not be compiled: 但是,以下代码无法编译:
[ERROR] found : org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.SparseVector)]
[ERROR] required: org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]
[ERROR] (which expands to) org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]
[ERROR] Note: (Long, org.apache.spark.ml.linalg.SparseVector) <: (Long, org.apache.spark.ml.linalg.Vector), but class RDD is invariant in type T.
[ERROR] You may wish to define T as +T instead. (SLS 4.5)
[ERROR] .filter(x => x._1 > 0)
[ERROR] ^
[ERROR] one error found
But if I remove the . toSparse
但是,如果我删除. toSparse
. toSparse
or .filter(x => x._1 > 0)
this code can be compiled successfully. . toSparse
或.filter(x => x._1 > 0)
此代码可以成功编译。
Does someone know why and what should I do to fix it? 有人知道为什么以及应该怎么解决吗?
Also is there any better way to read the file to RDD with ignoring the non-numeric id lines? 还有没有忽略非数字ID行的更好的方法将文件读取到RDD?
The code compiles successfully if you remove toSparse
because the type of your PairRDD
is (ItemId, Vector)
. 如果删除代码编译成功toSparse
因为你的类型PairRDD
是(ItemId, Vector)
。
The org.apache.spark.ml.linalg.Vector
class/type represent the Dense Vector which you are generating using Vector.dense
and when you call toSparse
it gets converted to org.apache.spark.ml.linalg.SparseVector
which is not the type that your PairRDD expects. org.apache.spark.ml.linalg.Vector
类/类型表示您使用Vector.dense
生成的密集向量,当您调用toSparse
它将转换为org.apache.spark.ml.linalg.SparseVector
,而不是PairRDD期望的类型。
As for filtering non-integer IDs I would say your method is a good way to do that. 至于过滤非整数ID,我想说您的方法是一种很好的方法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.