简体繁体 English

安装hadoop并编写map reduce程序

[英]installing hadoop and writing map reduce program

原文 2018-10-10 04:49:01 4 1 java/ hadoop/ mapreduce

for a subject I have got this assignment. 对于一个主题，我已经完成了这项任务。

To make a hadoop cluster and write map reduce programs. 要创建hadoop集群并编写map简化程序。

I have a laptop with 4GB ram and i3 processor, I downloaded vmware image from cloudera website. 我有一台配备4GB内存和i3处理器的笔记本电脑，我从cloudera网站下载了vmware映像。 But the pre configured virtual machine itself takes 4GB RAM. 但是预先配置的虚拟机本身需要4GB RAM。

The assignment text says 作业文本说

install the Hadoop Distribution of Cloudera ( http://www.cloudera.com/hadoop/ ) in Pseudo-Distributed Mode or use the VMWare Image provided by Cloudera to familiarize yourself with Hadoop, especially with the distributed file system HDFS and the implementation of MapReduce programs in Java." 以伪分布式模式安装Cloudera的Hadoop分发版（ http://www.cloudera.com/hadoop/ ）或使用Cloudera提供的VMWare映像熟悉Hadoop，尤其是分布式文件系统HDFS和Hadoop的实现Java中的MapReduce程序。”

I downloaded vmware image from cloudera website, but the pre configured virtual machine itself takes 4GB RAM. 我从cloudera网站下载了vmware映像，但是预先配置的虚拟机本身需要4GB RAM。

I tried reducing the size of Virtual machine memory from 4GB to 1 GB but it was not good; 我曾尝试将虚拟机内存的大小从4GB减小到1GB，但这并不好。 I mean I could not run the cloudera virtual machine. 我的意思是我无法运行cloudera虚拟机。

I have a lot of mapreduce and java programs which the assignment says me to do. 我有很多mapreduce和java程序，作业要我这样做。 I am not able to understand any of them. 我无法理解其中任何一个。 Like 喜欢

doing a "grep" on multiple machines. 在多台计算机上执行“ grep”。
Counting word frequency on files spread across multiple machines in hadoop cluster etc etc. 计算分布在hadoop集群等中多台计算机上的文件中的单词频率

I want to know how do I setup hadoop so that it runs on windows8.1 machine so that I can run these programs 我想知道如何设置hadoop，使其在Windows8.1机器上运行，以便我可以运行这些程序

1 个解决方案

Cloudera VM requires 6-8GB to run correctly. Cloudera VM需要6-8GB才能正常运行。

When I took a Hadoop course in university, it was required for us to buy more RAM for all computers with less than 8GB, and we had i5's but the VM is still really slow. 当我上大学的Hadoop课程时，要求我们为所有8GB以下的计算机购买更多的RAM，而我们拥有i5，但VM的速度仍然很慢。

Even just installing Hadoop and running the services alone outside of a VM will require a minimum of 4GB, by default. 默认情况下，即使仅安装Hadoop并在VM外部单独运行服务也将至少需要4GB。 That's not including your OS and other services (your browser and OS are probably already taking 1GB each just by themselves). 这还不包括您的操作系统和其他服务（您的浏览器和操作系统可能已经单独占用了1GB）。

As far as actually installing Hadoop on Windows, I wouldn't recommend it, but rough steps are 至于在Windows上实际安装Hadoop，我不建议这样做，但是粗略的步骤

Install Java. 安装Java。 Add JAVA_HOME as environment variable 添加JAVA_HOME作为环境变量
Install and run an SSH server on your windows machine. 在Windows计算机上安装并运行SSH服务器。 Make sure you can connect to localhost:22 using PuTTy, for example 确保使用例如PuTTy连接到localhost:22
Then download and configure Hadoop using Apache site, not random tutorials elsewhere that could be out-of-date. 然后使用Apache网站下载并配置Hadoop，而不是在其他地方可能会过时的随机教程。 Start at Single Node , then configure Pseudo-distributed. 从Single Node开始，然后配置Pseudo-distributed。 As soon as you extract Hadoop download, add HADOOP_PREFIX and HADOOP_CONF_DIR=%HADOOP_PREFIX%/conf as two environment variables 提取Hadoop下载文件后，立即将HADOOP_PREFIX和HADOOP_CONF_DIR=%HADOOP_PREFIX%/conf为两个环境变量

doing a "grep" on multiple machines 在多台机器上执行“ grep”

Counting word frequency on files 计算文件中的单词频率

Both of these are the examples given in the documentation. 这两个都是文档中给出的示例。 Not sure you are required to actually write that code. 不确定是否需要实际编写该代码。

grep
wordcount 字数

FWIW, you don't actually need a running Hadoop cluster to run MapReduce. FWIW，您实际上不需要运行的Hadoop集群即可运行MapReduce。 The default Hadoop configurations will read from your single, local filesystem. 默认的Hadoop配置将从您的单个本地文件系统中读取。 Besides, your VM is a single machine anyway, so the requirement of "running on multiple machines" doesn't make much sense. 此外，您的VM还是单台机器，因此“在多台机器上运行”的要求没有多大意义。