[英]scala-spark: back to back clicks
我正在學習scala,並激發了這兩種技術的新奇之處:
假設我有一個像這樣的文件:
"1421453179.157" P0105451998 "SCREEN"
"1421453179.157" P0106586529 "PRESENTATION"
"1421453179.157" P0108481590 NULL
"1421453179.157" P0108481590 "SCREEN"
"1421453179.157" P0112397365 "FULL_SCREEN"
"1421453179.157" P0113994553 "FULL_SCREEN"
"1421453179.158" P0112360870 "DATA_INFO" dataId:5913974361807341112
"1421453179.159" P0112360870 "DATA_INFO" dataId:7658923479992321112
"1421453179.160" P0108137271 "SCREEN"
"1421453179.161" P0103681986 "SCREEN"
"1421453179.162" P0104229251 "PRESENTATION"
第一列是時間,第二列是user_id,第三列的含義取決於第四列中的數據。
我要完成以下工作:
我想查找連續的DATA_INFO記錄並生成以下內容
P0112360870, 5913974361807341112|7658923479992321112
該行的語言解釋為用戶P0112360870
單擊5913974361807341112|7658923479992321112
第一次單擊應在開始處,此處5913974361807341112是第一次單擊。
我從以下內容開始:
val data=sc.textFile("hdfs://*").map(line=> {val tks=line.split("\t",3); (tks(1),(tks(0),tks(2))) } )
val data2=data.groupBy( a=> a._1).take(1000)
但是不知道從這里繼續前進。
val data=sc.textFile("hdfs://*").map( line => line.split( "\t" ).toList )
// you probably want only those with pxx with at least some data.
val filteredData = data.filter( l => l.length > 3 )
val groupedData = data.groupBy( l => l( 1 ) )
val iWantedThis = groupedData.map( ( pxxx, iterOfList ) => {
// every pxxxx group will have at least one entry with data.
val firstData = iterOfList.head( 3 )
// Now concatenate all other data's to the firstdata
val datas = iterOfList.tail.foldLeft( firstData )( ( fd, l ) => fd + "|" + l( 3 ) )
// return the string with \t as separtor.
List( pxxx, datas ).mkString( "\t" )
} )
我認為您的解決方法一開始是錯誤的。 如果知道您的鍵,則使用以下命令將其設置為適當的鍵值元組:
sc.textFile("hdfs://*")
.map(_.split("\t",3)) //Split on tabs
.map(tks=>(tks(1),(tks(0),tks(2)))) //Create a (key, Tuple2) pairing
.reduceByKey(
(x,y)=>
if(x._1 contains "DATA_INFO") (s"${x._2}|${y._2}".replace("dataId:",""), "")
else x //Ignore duplicate non-DATA_INFO elements by dropping?????
)
需要注意的最大事情是,您需要處理else情況,但這是適當的。
每項要求澄清
(s"${x._2}|${y._2}".replace("dataId:",""), "") //Using string interpolation
是相同的
val concatenatedString = x._2 +"|"+y._2
val concatStringWithoutMetaData = concatenatedString.replace("dataId:","")
(concatStringWithoutMetaData, "") //Return the new string with an empty final column
使用spark-shell
(類似於Spark的REPL)來測試您的想法通常非常有用。 尤其是當您不熟悉它時。
運行spark shell(在bin/spark-shell
),然后創建測試數據集:
val input = """
"1421453179.157" P0105451998 "SCREEN"
"1421453179.157" P0106586529 "PRESENTATION"
"1421453179.157" P0108481590 NULL
"1421453179.157" P0108481590 "SCREEN"
"1421453179.157" P0112397365 "FULL_SCREEN"
"1421453179.157" P0113994553 "FULL_SCREEN"
"1421453179.158" P0112360870 "DATA_INFO" dataId:5913974361807341112
"1421453179.159" P0112360870 "DATA_INFO" dataId:7658923479992321112
"1421453179.160" P0108137271 "SCREEN"
"1421453179.161" P0103681986 "SCREEN"
"1421453179.162" P0104229251 "PRESENTATION""""
sc.parallelize(input.split("\n").map(_.trim)).map(_.split("\\s+")).
filter(_.length > 3). // take only > 3 (so containing dataId)
map(a => a(1) -> a(3).split(":")(1) ). // create a pair for each row your user -> click
reduceByKey(_ + "|" + _). // reduce clicks per user
collect // get it to the driver
運行它時,您應該或多或少看到以下內容:
res0: Array[(String, String)] = Array((P0112360870,5913974361807341112|7658923479992321112))
我想這就是您要尋找的。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.