scala spark中的RDD過濾器

Question

我有一個數據集，我想提取那些（審查/文本）在x和y之間（審查/時間），例如（1183334400 <時間<1185926400），

這是我數據的一部分：

product/productId: B000278ADA
product/title: Jobst Ultrasheer 15-20 Knee-High Silky Beige Large
product/price: 46.34
review/userId: A17KXW1PCUAIIN
review/profileName: Mark Anthony "Mark"
review/helpfulness: 4/4
review/score: 5.0
review/time: 1174435200
review/summary: Jobst UltraSheer Knee High Stockings
review/text: Does a very good job of relieving fatigue.

product/productId: B000278ADB
product/title: Jobst Ultrasheer 15-20 Knee-High Silky Beige Large
product/price: 46.34
review/userId: A9Q3932GX4FX8
review/profileName: Trina Wehle
review/helpfulness: 1/1
review/score: 3.0
review/time: 1352505600
review/summary: Delivery was very long wait.....
review/text: It took almost 3 weeks to recieve the two pairs of stockings .

product/productId: B000278ADB
product/title: Jobst Ultrasheer 15-20 Knee-High Silky Beige Large
product/price: 46.34
review/userId: AUIZ1GNBTG5OB
review/profileName: dgodoy
review/helpfulness: 1/1
review/score: 2.0
review/time: 1287014400
review/summary: sizes recomended in the size chart are not real
review/text: sizes are much smaller than what is recomended in the chart. I tried to put it and sheer it!.

我的Spark-Scala代碼：

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.spark.{SparkConf, SparkContext}

object test1 {
  def main(args: Array[String]): Unit = {
    val conf1 = new SparkConf().setAppName("golabi1").setMaster("local")
    val sc = new SparkContext(conf1)
    val conf: Configuration = new Configuration
    conf.set("textinputformat.record.delimiter", "product/title:")
    val input1=sc.newAPIHadoopFile("data/Electronics.txt",     classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
    val lines = input1.map { text => text._2}
    val filt = lines.filter(text=>(text.toString.contains(tt => tt in (startdate until enddate))))
    filt.saveAsTextFile("data/filter1")
  }
}

但是我的代碼效果不好，

我怎樣才能過濾這些線？

Answer 1

比那簡單得多。 嘗試這個：

object test1 
{
  def main(args: Array[String]): Unit = 
  {
    val conf1 = new SparkConf().setAppName("golabi1").setMaster("local")
    val sc = new SparkContext(conf1)

    def extractDateAndCompare(line: String): Boolean=
    {
        val from = line.indexOf("/time: ") + 7
        val to = line.indexOf("review/text: ") -1
        val date = line.substring(from, to).toLong
        date > startDate && date < endDate
    }

    sc.textFile("data/Electronics.txt")
        .filter(extractDateAndCompare)
        .saveAsTextFile("data/filter1")
  }
}

我通常會找到那些中間輔助方法來使事情更加清晰。 當然，這假設邊界日期是在某處定義的，輸入文件包含格式問題。 我故意這樣做是為了保持這個簡單，但添加一個try，返回一個Option子句並使用flatMap（）可以幫助你避免錯誤，如果你有它們。

此外，您的原始文本有點麻煩，您可能想要探索Json，TSV文件或其他一些替代的，更簡單的格式。

scala spark中的RDD過濾器

問題描述

1 個解決方案

解決方案1
11 已采納 2015-04-20 22:39:45

scala spark中的RDD過濾器

問題描述

1 個解決方案

解決方案1 11 已采納 2015-04-20 22:39:45

解決方案1
11 已采納 2015-04-20 22:39:45