简体   繁体   English

如何在 Scala 中查看 RDD.join() 的结果?

[英]How can you view the result of RDD.join() in Scala?

I am trying to compute the result of a PageRanking algorithm, where the scoring function is the number of outgoing links on a page.我正在尝试计算 PageRanking 算法的结果,其中评分函数是页面上传出链接的数量。

val links = warcs.map{ wr => wr._2.getRecord()}.
               map{ wb => {
                        val url = wb.getHeader().getUrl()
                        val d = Jsoup.parse(wb.getHttpStringBody())
                        val links = d.select("a").asScala
                        links.map(l => (url,l.attr("href"))).toIterator
                    }
                }.
                flatMap(identity).map(t => (t._1,List(t._2))).reduceByKey(_:::_)
                



var ranks = warcs.map{ wr => wr._2.getRecord()}.
                  map{ wb => (wb.getHeader().getUrl(), Jsoup.parse(wb.getHttpStringBody()).select("a[href]").size())}.
                  filter{ l => l._2 > 0}

The links RDD is of the form (URL, list of outgoing URLs) and ranks is of the form (URL, number of outgoing URLs).链接 RDD 的形式是(URL,传出 URL 列表),排名是形式(URL,传出 URL 的数量)。

This is what the pageranking looks like:这是页面排名的样子:

for(i <- 1 to 10){
    val contribs = links.join(ranks).flatMap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }

    ranks = contribs.reduceByKey((x,y) => x+y).mapValues(sum => (0.15 + 0.85*sum).toInt)
}

This being said, when I try to check the results of the ranking algorithm, I am met with an IndexOutOfBoundsException.话虽如此,当我尝试检查排名算法的结果时,遇到了 IndexOutOfBoundsException。 I tried seeing if the resulting RDD is empty by printing ranks.isEmpty() and I get the same exception.我尝试通过打印ranks.isEmpty()查看生成的RDD 是否为空,我得到了同样的异常。

I have tried out of curiosity to see the result of links.join(ranks) , but the same exception once again occurs.出于好奇,我尝试查看links.join(ranks)的结果,但同样的异常再次发生。

What is going wrong with the join() operation, and how can I progress? join() 操作出了什么问题,我该如何进行?

Turns out the problem was in my creation of the WARC files that I was using,原来问题出在我正在使用的 WARC 文件的创建中,

val warcs = sc.newAPIHadoopFile(
              warcfile,
              classOf[WarcGzInputFormat],             // InputFormat
              classOf[NullWritable],                  // Key
              classOf[WarcWritable]                   // Value
            ).cache()

Turns out removing .cache() stops the exceptions.原来删除.cache()停止异常。 I don't know why though, so an explanation would still be welcome.我不知道为什么,所以仍然欢迎解释。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM