简体   繁体   English

Spark-Scala计算

[英]Spark - Scala calculation

I want to calculate using spark and scala the h-ndex for a researcher ( https://en.wikipedia.org/wiki/H-index ) from a csv file with data in the format 我想使用Spark和Scala从csv文件中使用数据的格式为研究人员( https://en.wikipedia.org/wiki/H-index )计算h-ndex

R1:B, R1:A, R1:B, R2:C, R2:B, R2:A, R1:D, R1:B, R1:D, R2:B, R1:A, R1:B R1:B,R1:A,R1:B,R2:C,R2:B,R2:A,R1:D,R1:B,R1:D,R2:B,R1:A,R1:B

The h-index is the academic indicator of a researcher and it is computed by creating a sinlge list for all reacerchers with their publications sorted eg R1 : { A:10 , B:5 , C:1} and then finding the index of the the last position where a value is bigger than itsindex (here is position 2 because 1 < 3). h指数是研究者的学术指标,它的计算方法是为所有食肉动物创建一个单子列表,并对其出版物进行排序,例如R1:{A:10,B:5,C:1},然后找到该指数。值大于其索引的最后位置(这里是位置2,因为1 <3)。

I cannot find a solution for spark using scala. 我找不到使用Scala的Spark解决方案。 Can anyone help? 有人可以帮忙吗?

In case you have a file like this: 如果您有这样的文件:

R1:B, R1:A, R1:B, R2:C, R2:B, R2:A, R1:D, R1:B, R1:D, R2:B, R1:A, R1:B
R1:B, R1:A, R1:B, R2:C, R2:B, R2:A, R1:D, R1:B, R1:D, R2:B, R1:A, R1:B
R1:B, R1:A, R1:B, R2:C, R2:B, R2:A, R1:D, R1:B, R1:D, R2:B, R1:A, R1:B

Here are some thoughts: 这里有一些想法:

// add a count field to each researcher:paper pair
input.flatMap(line => line.split(", ").map(_ -> 1)).
      // count with research:paper as the key
      reduceByKey(_+_).map{ case (ra, count) => {
          // split research:paper
          val Array(author, article) = ra.split(":")
          // map so that the researcher will be new key
          author -> (article, count)
     // group result by the researcher
     }}.groupByKey.collect

// res15: Array[(String, Iterable[(String, Int)])] = Array((R2,CompactBuffer((B,6), (A,3), (C,3))), (R1,CompactBuffer((A,6), (B,12), (D,6))))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM