简体   繁体   English

如何在R中编写map reduce?

[英]How to write map reduce in R?

I am new to R. I know how to write map reduce in Java. 我是R的新手。我知道如何用Java编写map reduce。 I want to try the same in R. So can any one help in giving any samle codes and is there any fixed format there for MapReduce in R. 我想在R中尝试同样的方法。因此,任何人都可以提供任何萨姆代码来帮助您,R中的MapReduce是否有固定的格式?

Please send any link other than this: https://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial 请发送除此以外的任何链接: https : //github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial

Any sample codes will be more helpful. 任何示例代码将更有用。

When you want to implement a map reduce (with Hadoop) in a language other than Java, then you use a feature called streaming. 当您想要使用Java以外的其他语言(使用Hadoop)实现map reduce时,可以使用一种称为流技术的功能。 Then the data is fed to the mapper via STDIN (readLines()), back to Hadoop via STDOUT(cat()), then to the reducer again through STDIN (readLines()) and blurted finally via STDOUT (cat()). 然后,数据通过STDIN(readLines())送入映射器,再通过STDOUT(cat())送回Hadoop,然后再次通过STDIN(readLines())送入reducer,最后通过STDOUT(cat())进行模糊处理。

The following code is taken from an article I wrote on writing a map reduce job with R for Hadoop. 以下代码取自我写的一篇文章 ,该文章是用R for Hadoop编写地图缩减作业的。 The code is supposed to count 2-grams but I'd say simple enough to see what is going on MapReduce-wise. 该代码本应计数为2克,但我想说得足够简单,以了解MapReduce方面的情况。

# map.R

library(stringdist, quietly=TRUE)

input <- file("stdin", "r")

while(length(line <- readLines(input, n=1, warn=FALSE)) > 0) {
   # in case of empty lines
   # more sophisticated defensive code makes sense here
   if(nchar(line) == 0) break

   fields <- unlist(strsplit(line, "\t"))

   # extract 2-grams
   d <- qgrams(tolower(fields[4]), q=2)

   for(i in 1:ncol(d)) {
     # language / 2-gram / count
     cat(fields[2], "\t", colnames(d)[i], "\t", d[1,i], "\n")
   }
}

close(input)

- --

# reduce.R

input <- file("stdin", "r")

# initialize variables that keep
# track of the state

is_first_line <- TRUE

while(length(line <- readLines(input, n=1, warn=FALSE)) > 0) {
   line <- unlist(strsplit(line, "\t"))
   # current line belongs to previous
   # line's key pair
   if(!is_first_line &&
      prev_lang == line[1] &&
      prev_2gram == line[2]) {
        sum <- sum + as.integer(line[3])
   }
   # current line belongs either to a
   # new key pair or is first line
   else {
     # new key pair - so output the last
     # key pair's result
     if(!is_first_line) {
       # language / 2-gram / count
       cat(prev_lang,"\t",prev_2gram,"\t",sum,"\n")
     }
     # initialize state trackers
     prev_lang <- line[1]
     prev_2gram <- line[2]
     sum <- as.integer(line[3])
     is_first_line <- FALSE
   }
}

# the final record
cat(prev_lang,"\t",prev_2gram, "\t", sum, "\n")

close(input)

http://www.joyofdata.de/blog/mapreduce-r-hadoop-amazon-emr/ http://www.joyofdata.de/blog/mapreduce-r-hadoop-amazon-emr/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM