如何使用Scala過濾Spark-Shell中的數據？

Question

我有以下數據，需要使用spark（scala）進行排序，這樣，我只需要訪問“ Walmart”而不是“ Bestbuy”的人員的ID。 商店可能是重復性的，因為一個人可以多次訪問該商店。

輸入數據：

ID，存儲

1，沃爾瑪

1，百思買

2，目標

3，沃爾瑪

4，百思買

預期產量：3，沃爾瑪

我已經獲得了使用dataFrames的輸出，並在spark上下文上運行了SQL查詢。 但是有沒有辦法在沒有dataFrames的情況下使用groupByKey / reduceByKey等來做到這一點。 有人可以幫我提供代碼groupByKey map-> groupByKey ，已經形成了ShuffleRDD ，我在過濾CompactBuffer時遇到了困難！

我使用sqlContext獲得的代碼如下：

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD

case class Person(id: Int, store: String)

val people = sc.textFile("examples/src/main/resources/people.txt")
               .map(_.split(","))
               .map(p => Person(p(1)trim.toInt, p(1)))
people.registerTempTable("people")

val result = sqlContext.sql("select id, store from people left semi join (select id from people where store in('Walmart','Bestbuy') group by id having count(distinct store)=1) sample on people.id=sample.id and people.url='Walmart'")

我現在正在嘗試的代碼是這樣，但是在執行第三步后我被打斷了：

val data = sc.textFile("examples/src/main/resources/people.txt")
             .map(x=> (x.split(",")(0),x.split(",")(1)))
             .filter(!_.filter("id"))
val dataGroup = data.groupByKey()
val dataFiltered = dataGroup.map{case (x,y) =>
    val url = y.flatMap(x=> x.split(",")).toList
    if (!url.contains("Bestbuy") && url.contains("Walmart")){
        x.map(x=> (x,y))}}

如果我做dataFiltered.collect（），我得到Array [Any] = Array（Vector（（3，Walmart）），（），（））

請幫助我完成此步驟后如何提取輸出

Answer 1

要過濾RDD，只需使用RDD.filter ：

val dataGroup = data.groupByKey()

val dataFiltered = dataGroup.filter {
  // keep only lists that contain Walmart but do not contain Bestbuy:
  case (x, y) => val l = y.toList; l.contains("Walmart") && !l.contains("Bestbuy")
}

dataFiltered.foreach(println) // prints: (3,CompactBuffer(Walmart))

// if you want to flatten this back to tuples of (id, store):
val result = dataFiltered.flatMap { case (id, stores) => stores.map(store => (id, store)) }

result.foreach(println) // prints: (3, Walmart)

Answer 2

我也嘗試了另一種方法，它解決了

val data = sc.textFile("examples/src/main/resources/people.txt")
     .filter(!_.filter("id"))
         .map(x=> (x.split(",")(0),x.split(",")(1)))
data.cache()
val dataWalmart = data.filter{case (x,y) => y.contains("Walmart")}.distinct()
val dataBestbuy = data.filter{case (x,y) => y.contains("Bestbuy")}.distinct()
val result = dataWalmart.subtractByKey(dataBestbuy)
data.uncache()

如何使用Scala過濾Spark-Shell中的數據？

問題描述

2 個解決方案

解決方案1
0 已采納 2016-08-15 07:26:34

解決方案2
0 2016-08-15 18:18:01

如何使用Scala過濾Spark-Shell中的數據？

問題描述

2 個解決方案

解決方案1 0 已采納 2016-08-15 07:26:34

解決方案2 0 2016-08-15 18:18:01

解決方案1
0 已采納 2016-08-15 07:26:34

解決方案2
0 2016-08-15 18:18:01