[英]How to update values in a cell of a dataframe Spark Scala
我正在嘗試在 reddit May2015 數據集上實現 pagerank 算法,但我無法提取評論中引用的 subreddits。 一列包含 subreddit 的名稱,另一列包含發布在該 subreddit 中的評論,該評論引用了另一個 subreddit。
subreddit body
videos|"Tagged you as ""...
Quebec|Ok, c'est quoi le...
pokemon|Sorry to hear abo...
videos|Not sure what the...
ClashOfClans|Your submission, ...
realtech|Original /r/techn...
guns|Welp, those basta...
IAmA|If you are very i...
WTF|If you go on /r/w...
Fitness|Your submission h...
gifs|Hi! Take a look a...
Coachella|Yeah. If you go /...
我所做的是這樣的:
val df = spark.read
.format("csv")
.option("header", "true")
.load("path\\May2015.csv")
val df1 = df.filter(df("body").contains("/r/")).select("subreddit", "body")
val lines = df1.rdd
val links = lines.map{ s =>
val x = s(1).toString.split(" ")
val b = x.filter(_.startsWith("/r/")).toList
val t = b(0)
(s(0), t)
}.distinct().groupByKey().cache()
var ranks = links.mapValues(v =>0.25)
for (i <- 1 to iters) {
val contribs = links.join(ranks).values.flatMap{ case (urls, rank) =>
val size = urls.size
urls.map(url =(url, rank / size))
}
ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)
}
問題是 output 總是:
(subreddit, CompactBuffer())
雖然我想要的是:
(subreddit, anothersubreddit)
我設法解決了這個問題,但現在我遇到了另一個錯誤:
> type mismatch; found : org.apache.spark.rdd.RDD[(String, Double)]
> required: org.apache.spark.rdd.RDD[(Any, Double)] Note: (String,
> Double) <: (Any, Double), but class RDD is invariant in type T. You
> may wish to define T as +T instead. (SLS 4.5)
> ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)
問題大概出在這里
val links = lines.map{ s =>
val x = s(1).toString.split(" ")
val b = x.filter(_.startsWith("/r/")).toList
val t = b(0)
(s(0), t)
...
您需要在此處避免元組的第一個元素為Any
,因此如果您希望s(0)
可能具有String
類型,您可以使用顯式轉換,如s(0).asInstanceOf[String]
或通過方法s.getAs[String]
甚至s.getString(0)
。
所以,解決編譯錯誤的版本可能如下:
val links = lines.map{ s =>
val x = s.getString(1).split(" ")
val b = x.filter(_.startsWith("/r/")).toList
val t = b(0)
(s.getString(0), t)
...
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.