简体   繁体   English

Spark Scala使用地图和过滤器读取文本文件

[英]Spark scala reading text file with map and filter

I've a text file with following format ( id,f1,f2,f3,...,fn ): 我有一个以下格式的文本文件( id,f1,f2,f3,...,fn ):

12345,0,0,1,2,...,3
23456,0,0,1,2,...,0
33333,0,1,1,0,...,0
56789,1,0,0,0,...,4
a_123,0,0,0,6,...,3

And I want to read the file (ignore the line like a_123,0,0,0,6,...,3 ) to create a RDD[(Long, Vector) . 我想读取文件(忽略像a_123,0,0,0,6,...,3这样的行)以创建RDD[(Long, Vector) Here's my solution: 这是我的解决方案:

  def readDataset(path: String, sparkSession: SparkSession): RDD[(ItemId, Vector)] = {
    val sc = sparkSession.sparkContext
    sc.textFile(path)
      .map({ line => val values=line.split(",")
        (
          values(0).toLong,
          //util.Try(values(0).toLong).getOrElse(0L),
          Vectors.dense(values.slice(1, values.length).map {x => x.toDouble }).toSparse
        )})
      .filter(x => x._1 > 0)
  }

However this code can not be compiled: 但是,以下代码无法编译:

[ERROR]  found   : org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.SparseVector)]
[ERROR]  required: org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]
[ERROR]     (which expands to)  org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)]
[ERROR] Note: (Long, org.apache.spark.ml.linalg.SparseVector) <: (Long, org.apache.spark.ml.linalg.Vector), but class RDD is invariant in type T.
[ERROR] You may wish to define T as +T instead. (SLS 4.5)
[ERROR]       .filter(x => x._1 > 0)
[ERROR]              ^
[ERROR] one error found

But if I remove the . toSparse 但是,如果我删除. toSparse . toSparse or .filter(x => x._1 > 0) this code can be compiled successfully. . toSparse.filter(x => x._1 > 0)此代码可以成功编译。

Does someone know why and what should I do to fix it? 有人知道为什么以及应该怎么解决吗?

Also is there any better way to read the file to RDD with ignoring the non-numeric id lines? 还有没有忽略非数字ID行的更好的方法将文件读取到RDD?

The code compiles successfully if you remove toSparse because the type of your PairRDD is (ItemId, Vector) . 如果删除代码编译成功toSparse因为你的类型PairRDD(ItemId, Vector)

The org.apache.spark.ml.linalg.Vector class/type represent the Dense Vector which you are generating using Vector.dense and when you call toSparse it gets converted to org.apache.spark.ml.linalg.SparseVector which is not the type that your PairRDD expects. org.apache.spark.ml.linalg.Vector类/类型表示您使用Vector.dense生成的密集向量,当您调用toSparse它将转换为org.apache.spark.ml.linalg.SparseVector ,而不是PairRDD期望的类型。

As for filtering non-integer IDs I would say your method is a good way to do that. 至于过滤非整数ID,我想说您的方法是一种很好的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM