簡體   English   中英

scala-spark:背靠背點擊

[英]scala-spark: back to back clicks

我正在學習scala,並激發了這兩種技術的新奇之處:

假設我有一個像這樣的文件:

"1421453179.157"        P0105451998  "SCREEN"   
"1421453179.157"        P0106586529  "PRESENTATION"     
"1421453179.157"        P0108481590   NULL    
"1421453179.157"        P0108481590  "SCREEN"        
"1421453179.157"        P0112397365  "FULL_SCREEN"   
"1421453179.157"        P0113994553  "FULL_SCREEN"   
"1421453179.158"        P0112360870  "DATA_INFO"    dataId:5913974361807341112
"1421453179.159"        P0112360870  "DATA_INFO"    dataId:7658923479992321112   
"1421453179.160"        P0108137271  "SCREEN"   
"1421453179.161"        P0103681986  "SCREEN"   
"1421453179.162"        P0104229251  "PRESENTATION"  

第一列是時間,第二列是user_id,第三列的含義取決於第四列中的數據。

我要完成以下工作:

我想查找連續的DATA_INFO記錄並生成以下內容

P0112360870, 5913974361807341112|7658923479992321112

該行的語言解釋為用戶P0112360870單擊5913974361807341112|7658923479992321112第一次單擊應在開始處,此處5913974361807341112是第一次單擊。

我從以下內容開始:

val data=sc.textFile("hdfs://*").map(line=> {val tks=line.split("\t",3); (tks(1),(tks(0),tks(2))) } )
val data2=data.groupBy( a=> a._1).take(1000)

但是不知道從這里繼續前進。

val data=sc.textFile("hdfs://*").map( line => line.split( "\t" ).toList )

// you probably want only those with pxx with at least some data.
val filteredData = data.filter( l => l.length > 3 )

val groupedData = data.groupBy( l => l( 1 ) )

val iWantedThis = groupedData.map( ( pxxx, iterOfList ) => {
    // every pxxxx group will have at least one entry with data.
    val firstData = iterOfList.head( 3 )
    // Now concatenate all other data's to the firstdata
    val datas = iterOfList.tail.foldLeft( firstData )( ( fd, l ) => fd + "|" + l( 3 ) )
    // return the string with \t as separtor.
    List( pxxx, datas ).mkString( "\t" )
} )

我認為您的解決方法一開始是錯誤的。 如果知道您的鍵,則使用以下命令將其設置為適當的鍵值元組:

sc.textFile("hdfs://*")
  .map(_.split("\t",3)) //Split on tabs
  .map(tks=>(tks(1),(tks(0),tks(2)))) //Create a (key, Tuple2) pairing
  .reduceByKey(
    (x,y)=>
    if(x._1 contains "DATA_INFO") (s"${x._2}|${y._2}".replace("dataId:",""), "")
    else x //Ignore duplicate non-DATA_INFO elements by dropping?????
  )

需要注意的最大事情是,您需要處理else情況,但這是適當的。

每項要求澄清

(s"${x._2}|${y._2}".replace("dataId:",""), "") //Using string interpolation

是相同的

val concatenatedString = x._2 +"|"+y._2
val concatStringWithoutMetaData = concatenatedString.replace("dataId:","")
(concatStringWithoutMetaData, "") //Return the new string with an empty final column

使用spark-shell (類似於Spark的REPL)來測試您的想法通常非常有用。 尤其是當您不熟悉它時。

運行spark shell(在bin/spark-shell ),然后創建測試數據集:

val input = """
"1421453179.157"        P0105451998  "SCREEN"   
"1421453179.157"        P0106586529  "PRESENTATION"     
"1421453179.157"        P0108481590   NULL    
"1421453179.157"        P0108481590  "SCREEN"        
"1421453179.157"        P0112397365  "FULL_SCREEN"   
"1421453179.157"        P0113994553  "FULL_SCREEN"   
"1421453179.158"        P0112360870  "DATA_INFO"    dataId:5913974361807341112
"1421453179.159"        P0112360870  "DATA_INFO"    dataId:7658923479992321112   
"1421453179.160"        P0108137271  "SCREEN"   
"1421453179.161"        P0103681986  "SCREEN"   
"1421453179.162"        P0104229251  "PRESENTATION""""


sc.parallelize(input.split("\n").map(_.trim)).map(_.split("\\s+")).
  filter(_.length > 3). // take only > 3 (so containing dataId)
  map(a => a(1) -> a(3).split(":")(1) ). // create a pair for each row your user -> click
  reduceByKey(_ + "|" + _). // reduce clicks per user
  collect // get it to the driver

運行它時,您應該或多或少看到以下內容:

res0: Array[(String, String)] = Array((P0112360870,5913974361807341112|7658923479992321112))

我想這就是您要尋找的。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM