简体   繁体   English

如何更新 dataframe Spark Scala 的单元格中的值

[英]How to update values in a cell of a dataframe Spark Scala

I am trying to implement a pagerank alghoritm on the reddit May2015 dataset but I can't manage to extract the subreddits referenced in the comments.我正在尝试在 reddit May2015 数据集上实现 pagerank 算法,但我无法提取评论中引用的 subreddits。 A column contains the name of the subreddit and the other contains a comment posted in that subreddit that references another subreddit.一列包含 subreddit 的名称,另一列包含发布在该 subreddit 中的评论,该评论引用了另一个 subreddit。

   subreddit                body

      videos|"Tagged you as ""...
      Quebec|Ok, c'est quoi le...
     pokemon|Sorry to hear abo...
      videos|Not sure what the...
ClashOfClans|Your submission, ...
    realtech|Original /r/techn...
        guns|Welp, those basta...
        IAmA|If you are very i...
         WTF|If you go on /r/w...
     Fitness|Your submission h...
        gifs|Hi! Take a look a...
   Coachella|Yeah. If you go /...

What I did is this:我所做的是这样的:

val df = spark.read
      .format("csv")
      .option("header", "true")
      .load("path\\May2015.csv")

val df1 = df.filter(df("body").contains("/r/")).select("subreddit", "body")

val lines = df1.rdd

val links = lines.map{ s =>
      val x = s(1).toString.split(" ")
        val b = x.filter(_.startsWith("/r/")).toList
        val t = b(0)
        (s(0), t)
    }.distinct().groupByKey().cache()
 var ranks = links.mapValues(v =>0.25)
 for (i <- 1 to iters) {
      val contribs = links.join(ranks).values.flatMap{ case (urls, rank) =>
      val size = urls.size
      urls.map(url =(url, rank / size))
      }
     ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)
    }

Problem is that the output is always:问题是 output 总是:

(subreddit, CompactBuffer())

While what I want is:虽然我想要的是:

(subreddit, anothersubreddit)

I managed to solve this but now I am getting another error:我设法解决了这个问题,但现在我遇到了另一个错误:

> type mismatch;  found   : org.apache.spark.rdd.RDD[(String, Double)] 
> required: org.apache.spark.rdd.RDD[(Any, Double)] Note: (String,
> Double) <: (Any, Double), but class RDD is invariant in type T. You
> may wish to define T as +T instead. (SLS 4.5)
>       ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)

Probably the problem lies here问题大概出在这里

val links = lines.map{ s =>
  val x = s(1).toString.split(" ")
  val b = x.filter(_.startsWith("/r/")).toList
  val t = b(0)
  (s(0), t)
...

You need to avoid the first element of tuple as Any here, so if you expect that s(0) may have a type of String you can use explicit cast like s(0).asInstanceOf[String] or via method s.getAs[String] or even s.getString(0) .您需要在此处避免元组的第一个元素为Any ,因此如果您希望s(0)可能具有String类型,您可以使用显式转换,如s(0).asInstanceOf[String]或通过方法s.getAs[String]甚至s.getString(0)

So, the version that solves the compile error may be as follows:所以,解决编译错误的版本可能如下:

val links = lines.map{ s =>
  val x = s.getString(1).split(" ")
  val b = x.filter(_.startsWith("/r/")).toList
  val t = b(0)
  (s.getString(0), t)
...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM