簡體   English   中英

使用MongoDB作為hadoop map-reduce作業的I / O

[英]Use MongoDB as I/O for hadoop map-reduce job

我拼命嘗試執行EnronMail mongo-hadoop連接器示例( https://github.com/mongodb/mongo-hadoop/wiki/Enron-Emails-Example ),但未成功。 我收到此錯誤:

15/11/18 11:56:23 INFO util.MongoTool: Created a conf: 'Configuration: core-default.xml, core-site.xml, mongo_enron.xml, mapred-default.xml, mapred-site.xml, hdfs-default.xml, hdfs-site.xml' on {class com.mongodb.hadoop.examples.enron.EnronMail} as job named 'EnronMail'
15/11/18 11:56:23 INFO util.MongoTool: Setting up and running MapReduce job in foreground, will wait for results.  {Verbose? true}
15/11/18 11:56:23 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
15/11/18 11:56:23 INFO mapred.JobClient: Cleaning up the staging area hdfs://MASTER1:8020/tmp/hadoop-mapred/mapred/staging/user/.staging/job_201511020757_0042
15/11/18 11:56:23 ERROR security.UserGroupInformation: PriviledgedActionException as:user (auth:SIMPLE) cause:java.io.IOException: No FileSystem for scheme: mongodb
15/11/18 11:56:23 ERROR util.MongoTool: Exception while executing job...
java.io.IOException: No FileSystem for scheme: mongodb
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2296)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2303)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:87)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2342)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2324)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:351)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:194)
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:210)
        at com.mongodb.hadoop.BSONFileInputFormat.getSplits(BSONFileInputFormat.java:79)
        at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1079)
        at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1096)
        at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:177)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:995)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:948)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:948)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:566)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:596)
        at com.mongodb.hadoop.util.MongoTool.runMapReduceJob(MongoTool.java:230)
        at com.mongodb.hadoop.util.MongoTool.run(MongoTool.java:100)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
        at com.mongodb.hadoop.examples.enron.EnronMail.main(EnronMail.java:197)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:208)

在hadoop shell中執行以下命令后:

 hadoop jar /home/user/Pruebas/jars/bigdata-0.0.3-SNAPSHOT.jar com.mongodb.hadoop.examples.enron.EnronMail -Dmongo.input.split_size=8 -Dmongo.job.verbose=true -Dmongo.input.uri=mongodb://192.168.1.187:27017/mongoHadoopConnector.messages -Dmongo.output.uri=mongodb://192.168.1.187:27017/mongoHadoopConnector.message_pairs

注意:我在我的機器(192.168.1.187)中啟動了mongo服務器進程,並且局域網中的其他機器都可以訪問它。 集合中有數據。 我嘗試了幾種版本的依賴關系。 我的版本:

  • hadoop:Hadoop 2.0.0-cdh4.5.0

  • 蒙戈:3.0.7

這是我的Maven項目的POM:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.company.test</groupId>
    <artifactId>bigdata-light</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <packaging>jar</packaging>

    <name>bigdata-light</name>
    <url>http://maven.apache.org</url>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>

    <repositories>
        <repository>
            <id>cloudera</id>
            <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
        </repository>
    </repositories>

    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>3.8.1</version>
            <scope>test</scope>
        </dependency>


        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.0.0-cdh4.5.0</version>
        </dependency>

        <dependency>
            <groupId>org.mongodb.mongo-hadoop</groupId>
            <artifactId>mongo-hadoop-core</artifactId>
            <version>1.4.2</version>
        </dependency>

        <dependency>
            <groupId>org.mongodb</groupId>
            <artifactId>mongo-java-driver</artifactId>
            <version>3.0.3</version>
        </dependency>
    </dependencies>
    <build>
        <finalName>bigdata-0.0.3-SNAPSHOT</finalName>
        <plugins>
            <plugin>
                <artifactId>maven-antrun-plugin</artifactId>
                <version>1.7</version>
                <dependencies>
                    <dependency>
                        <groupId>org.apache.ant</groupId>
                        <artifactId>ant-jsch</artifactId>
                        <version>1.9.2</version>
                    </dependency>
                </dependencies>
                <executions>
                    <execution>
                        <phase>install</phase>
                        <configuration>
                            <target>
                                <ant antfile="${basedir}\build.xml">
                                    <target name="upload" />
                                </ant>
                            </target>
                        </configuration>
                        <goals>
                            <goal>run</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

</project>

請,任何幫助將不勝感激=)。 我被困了好幾天...:$

我找到了一種解決方案,可以幫助可能遇到相同問題的人們。
從MongoDB集合中讀取使用

MapredMongoConfigUtil.setInputFormat(getConf(), com.mongodb.hadoop.mapred.MongoInputFormat.class);

代替

MapredMongoConfigUtil.setInputFormat(getConf(), com.mongodb.hadoop.mapred.BSONFileInputFormat.class);

(也就是直接從所述讀取替代.bson通過產生的文件mongodump )在映射精簡配置類( EnronMail.java )。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM