如何使用Scala过滤Spark-Shell中的数据？

Question

I have the below data which needed to be sorted using spark(scala) in such a way that, I only need id of the person who visited "Walmart" but not "Bestbuy". 我有以下数据，需要使用spark（scala）进行排序，这样，我只需要访问“ Walmart”而不是“ Bestbuy”的人员的ID。 store might be repetitive because a person can visit the store any number of times. 商店可能是重复性的，因为一个人可以多次访问该商店。

Input Data: 输入数据：

id, store ID，存储

1, Walmart 1，沃尔玛

1, Bestbuy 1，百思买

2, Target 2，目标

3, Walmart 3，沃尔玛

4, Bestbuy 4，百思买

Output Expected: 3, Walmart 预期产量：3，沃尔玛

I have got the output using dataFrames and running SQL queries on spark context. 我已经获得了使用dataFrames的输出，并在spark上下文上运行了SQL查询。 But is there any way to do this using groupByKey / reduceByKey etc without dataFrames. 但是有没有办法在没有dataFrames的情况下使用groupByKey / reduceByKey等来做到这一点。 Can someone help me with the code, After map-> groupByKey , a ShuffleRDD has been formed and I am facing difficulty in filtering the CompactBuffer ! 有人可以帮我提供代码groupByKey map-> groupByKey ，已经形成了ShuffleRDD ，我在过滤CompactBuffer时遇到了困难！

The code with which I got it using sqlContext is below: 我使用sqlContext获得的代码如下：

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD

case class Person(id: Int, store: String)

val people = sc.textFile("examples/src/main/resources/people.txt")
               .map(_.split(","))
               .map(p => Person(p(1)trim.toInt, p(1)))
people.registerTempTable("people")

val result = sqlContext.sql("select id, store from people left semi join (select id from people where store in('Walmart','Bestbuy') group by id having count(distinct store)=1) sample on people.id=sample.id and people.url='Walmart'")

The code which I am trying now is this, but I am struck after the third step: 我现在正在尝试的代码是这样，但是在执行第三步后我被打断了：

val data = sc.textFile("examples/src/main/resources/people.txt")
             .map(x=> (x.split(",")(0),x.split(",")(1)))
             .filter(!_.filter("id"))
val dataGroup = data.groupByKey()
val dataFiltered = dataGroup.map{case (x,y) =>
    val url = y.flatMap(x=> x.split(",")).toList
    if (!url.contains("Bestbuy") && url.contains("Walmart")){
        x.map(x=> (x,y))}}

if I do dataFiltered.collect(), I am getting Array[Any] = Array(Vector((3,Walmart)), (), ()) 如果我做dataFiltered.collect（），我得到Array [Any] = Array（Vector（（3，Walmart）），（），（））

Please help me how to extract the output after this step 请帮助我完成此步骤后如何提取输出

Answer 1

To filter an RDD, just use RDD.filter : 要过滤RDD，只需使用RDD.filter ：

val dataGroup = data.groupByKey()

val dataFiltered = dataGroup.filter {
  // keep only lists that contain Walmart but do not contain Bestbuy:
  case (x, y) => val l = y.toList; l.contains("Walmart") && !l.contains("Bestbuy")
}

dataFiltered.foreach(println) // prints: (3,CompactBuffer(Walmart))

// if you want to flatten this back to tuples of (id, store):
val result = dataFiltered.flatMap { case (id, stores) => stores.map(store => (id, store)) }

result.foreach(println) // prints: (3, Walmart)

Answer 2

I also tried it another way and it worked out 我也尝试了另一种方法，它解决了

val data = sc.textFile("examples/src/main/resources/people.txt")
     .filter(!_.filter("id"))
         .map(x=> (x.split(",")(0),x.split(",")(1)))
data.cache()
val dataWalmart = data.filter{case (x,y) => y.contains("Walmart")}.distinct()
val dataBestbuy = data.filter{case (x,y) => y.contains("Bestbuy")}.distinct()
val result = dataWalmart.subtractByKey(dataBestbuy)
data.uncache()

如何使用Scala过滤Spark-Shell中的数据？

问题描述

2 个解决方案

解决方案1
0 已采纳 2016-08-15 07:26:34

解决方案2
0 2016-08-15 18:18:01

如何使用Scala过滤Spark-Shell中的数据？

问题描述

2 个解决方案

解决方案1 0 已采纳 2016-08-15 07:26:34

解决方案2 0 2016-08-15 18:18:01

解决方案1
0 已采纳 2016-08-15 07:26:34

解决方案2
0 2016-08-15 18:18:01