scala-spark：背靠背點擊

Question

我正在學習scala，並激發了這兩種技術的新奇之處：

假設我有一個像這樣的文件：

"1421453179.157"        P0105451998  "SCREEN"   
"1421453179.157"        P0106586529  "PRESENTATION"     
"1421453179.157"        P0108481590   NULL    
"1421453179.157"        P0108481590  "SCREEN"        
"1421453179.157"        P0112397365  "FULL_SCREEN"   
"1421453179.157"        P0113994553  "FULL_SCREEN"   
"1421453179.158"        P0112360870  "DATA_INFO"    dataId:5913974361807341112
"1421453179.159"        P0112360870  "DATA_INFO"    dataId:7658923479992321112   
"1421453179.160"        P0108137271  "SCREEN"   
"1421453179.161"        P0103681986  "SCREEN"   
"1421453179.162"        P0104229251  "PRESENTATION"

第一列是時間，第二列是user_id，第三列的含義取決於第四列中的數據。

我要完成以下工作：

我想查找連續的DATA_INFO記錄並生成以下內容

P0112360870, 5913974361807341112|7658923479992321112

該行的語言解釋為用戶P0112360870單擊5913974361807341112|7658923479992321112第一次單擊應在開始處，此處5913974361807341112是第一次單擊。

我從以下內容開始：

val data=sc.textFile("hdfs://*").map(line=> {val tks=line.split("\t",3); (tks(1),(tks(0),tks(2))) } )
val data2=data.groupBy( a=> a._1).take(1000)

但是不知道從這里繼續前進。

Answer 1

val data=sc.textFile("hdfs://*").map( line => line.split( "\t" ).toList )

// you probably want only those with pxx with at least some data.
val filteredData = data.filter( l => l.length > 3 )

val groupedData = data.groupBy( l => l( 1 ) )

val iWantedThis = groupedData.map( ( pxxx, iterOfList ) => {
    // every pxxxx group will have at least one entry with data.
    val firstData = iterOfList.head( 3 )
    // Now concatenate all other data's to the firstdata
    val datas = iterOfList.tail.foldLeft( firstData )( ( fd, l ) => fd + "|" + l( 3 ) )
    // return the string with \t as separtor.
    List( pxxx, datas ).mkString( "\t" )
} )

Answer 2

我認為您的解決方法一開始是錯誤的。 如果知道您的鍵，則使用以下命令將其設置為適當的鍵值元組：

sc.textFile("hdfs://*")
  .map(_.split("\t",3)) //Split on tabs
  .map(tks=>(tks(1),(tks(0),tks(2)))) //Create a (key, Tuple2) pairing
  .reduceByKey(
    (x,y)=>
    if(x._1 contains "DATA_INFO") (s"${x._2}|${y._2}".replace("dataId:",""), "")
    else x //Ignore duplicate non-DATA_INFO elements by dropping?????
  )

需要注意的最大事情是，您需要處理else情況，但這是適當的。

每項要求澄清

(s"${x._2}|${y._2}".replace("dataId:",""), "") //Using string interpolation

是相同的

val concatenatedString = x._2 +"|"+y._2
val concatStringWithoutMetaData = concatenatedString.replace("dataId:","")
(concatStringWithoutMetaData, "") //Return the new string with an empty final column

Answer 3

使用spark-shell （類似於Spark的REPL）來測試您的想法通常非常有用。 尤其是當您不熟悉它時。

運行spark shell（在bin/spark-shell ），然后創建測試數據集：

val input = """
"1421453179.157"        P0105451998  "SCREEN"   
"1421453179.157"        P0106586529  "PRESENTATION"     
"1421453179.157"        P0108481590   NULL    
"1421453179.157"        P0108481590  "SCREEN"        
"1421453179.157"        P0112397365  "FULL_SCREEN"   
"1421453179.157"        P0113994553  "FULL_SCREEN"   
"1421453179.158"        P0112360870  "DATA_INFO"    dataId:5913974361807341112
"1421453179.159"        P0112360870  "DATA_INFO"    dataId:7658923479992321112   
"1421453179.160"        P0108137271  "SCREEN"   
"1421453179.161"        P0103681986  "SCREEN"   
"1421453179.162"        P0104229251  "PRESENTATION""""


sc.parallelize(input.split("\n").map(_.trim)).map(_.split("\\s+")).
  filter(_.length > 3). // take only > 3 (so containing dataId)
  map(a => a(1) -> a(3).split(":")(1) ). // create a pair for each row your user -> click
  reduceByKey(_ + "|" + _). // reduce clicks per user
  collect // get it to the driver

運行它時，您應該或多或少看到以下內容：

res0: Array[(String, String)] = Array((P0112360870,5913974361807341112|7658923479992321112))

我想這就是您要尋找的。

scala-spark：背靠背點擊

問題描述

3 個解決方案

解決方案1
1 2015-02-18 21:36:41

解決方案2
1 已采納 2015-02-18 21:37:57

解決方案3
1 2015-02-18 22:07:26

scala-spark：背靠背點擊

問題描述

3 個解決方案

解決方案1 1 2015-02-18 21:36:41

解決方案2 1 已采納 2015-02-18 21:37:57

解決方案3 1 2015-02-18 22:07:26

解決方案1
1 2015-02-18 21:36:41

解決方案2
1 已采納 2015-02-18 21:37:57

解決方案3
1 2015-02-18 22:07:26