![](/img/trans.png)
[英]Scala - How to pass a string value to a data frame filter (Spark-Shell)
[英]How to filter the data in spark-shell using scala?
我有以下數據,需要使用spark(scala)進行排序,這樣,我只需要訪問“ Walmart”而不是“ Bestbuy”的人員的ID。 商店可能是重復性的,因為一個人可以多次訪問該商店。
輸入數據:
ID,存儲
1,沃爾瑪
1,沃爾瑪
1,百思買
2,目標
3,沃爾瑪
4,百思買
預期產量:3,沃爾瑪
我已經獲得了使用dataFrames的輸出,並在spark上下文上運行了SQL查詢。 但是有沒有辦法在沒有dataFrames的情況下使用groupByKey
/ reduceByKey
等來做到這一點。 有人可以幫我提供代碼groupByKey
map-> groupByKey
,已經形成了ShuffleRDD
,我在過濾CompactBuffer
時遇到了困難!
我使用sqlContext
獲得的代碼如下:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
case class Person(id: Int, store: String)
val people = sc.textFile("examples/src/main/resources/people.txt")
.map(_.split(","))
.map(p => Person(p(1)trim.toInt, p(1)))
people.registerTempTable("people")
val result = sqlContext.sql("select id, store from people left semi join (select id from people where store in('Walmart','Bestbuy') group by id having count(distinct store)=1) sample on people.id=sample.id and people.url='Walmart'")
我現在正在嘗試的代碼是這樣,但是在執行第三步后我被打斷了:
val data = sc.textFile("examples/src/main/resources/people.txt")
.map(x=> (x.split(",")(0),x.split(",")(1)))
.filter(!_.filter("id"))
val dataGroup = data.groupByKey()
val dataFiltered = dataGroup.map{case (x,y) =>
val url = y.flatMap(x=> x.split(",")).toList
if (!url.contains("Bestbuy") && url.contains("Walmart")){
x.map(x=> (x,y))}}
如果我做dataFiltered.collect(),我得到Array [Any] = Array(Vector((3,Walmart)),(),())
請幫助我完成此步驟后如何提取輸出
要過濾RDD,只需使用RDD.filter
:
val dataGroup = data.groupByKey()
val dataFiltered = dataGroup.filter {
// keep only lists that contain Walmart but do not contain Bestbuy:
case (x, y) => val l = y.toList; l.contains("Walmart") && !l.contains("Bestbuy")
}
dataFiltered.foreach(println) // prints: (3,CompactBuffer(Walmart))
// if you want to flatten this back to tuples of (id, store):
val result = dataFiltered.flatMap { case (id, stores) => stores.map(store => (id, store)) }
result.foreach(println) // prints: (3, Walmart)
我也嘗試了另一種方法,它解決了
val data = sc.textFile("examples/src/main/resources/people.txt")
.filter(!_.filter("id"))
.map(x=> (x.split(",")(0),x.split(",")(1)))
data.cache()
val dataWalmart = data.filter{case (x,y) => y.contains("Walmart")}.distinct()
val dataBestbuy = data.filter{case (x,y) => y.contains("Bestbuy")}.distinct()
val result = dataWalmart.subtractByKey(dataBestbuy)
data.uncache()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.