简体   繁体   中英

Use MongoDB as I/O for hadoop map-reduce job

I have desesperately tried to execute the EnronMail mongo-hadoop connector example ( https://github.com/mongodb/mongo-hadoop/wiki/Enron-Emails-Example ) without success. I get this error:

15/11/18 11:56:23 INFO util.MongoTool: Created a conf: 'Configuration: core-default.xml, core-site.xml, mongo_enron.xml, mapred-default.xml, mapred-site.xml, hdfs-default.xml, hdfs-site.xml' on {class com.mongodb.hadoop.examples.enron.EnronMail} as job named 'EnronMail'
15/11/18 11:56:23 INFO util.MongoTool: Setting up and running MapReduce job in foreground, will wait for results.  {Verbose? true}
15/11/18 11:56:23 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
15/11/18 11:56:23 INFO mapred.JobClient: Cleaning up the staging area hdfs://MASTER1:8020/tmp/hadoop-mapred/mapred/staging/user/.staging/job_201511020757_0042
15/11/18 11:56:23 ERROR security.UserGroupInformation: PriviledgedActionException as:user (auth:SIMPLE) cause:java.io.IOException: No FileSystem for scheme: mongodb
15/11/18 11:56:23 ERROR util.MongoTool: Exception while executing job...
java.io.IOException: No FileSystem for scheme: mongodb
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2296)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2303)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:87)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2342)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2324)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:351)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:194)
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:210)
        at com.mongodb.hadoop.BSONFileInputFormat.getSplits(BSONFileInputFormat.java:79)
        at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1079)
        at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1096)
        at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:177)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:995)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:948)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:948)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:566)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:596)
        at com.mongodb.hadoop.util.MongoTool.runMapReduceJob(MongoTool.java:230)
        at com.mongodb.hadoop.util.MongoTool.run(MongoTool.java:100)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
        at com.mongodb.hadoop.examples.enron.EnronMail.main(EnronMail.java:197)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:208)

after executing this command in the hadoop shell:

 hadoop jar /home/user/Pruebas/jars/bigdata-0.0.3-SNAPSHOT.jar com.mongodb.hadoop.examples.enron.EnronMail -Dmongo.input.split_size=8 -Dmongo.job.verbose=true -Dmongo.input.uri=mongodb://192.168.1.187:27017/mongoHadoopConnector.messages -Dmongo.output.uri=mongodb://192.168.1.187:27017/mongoHadoopConnector.message_pairs

Notes: I have the mongo server process launched in my machine (192.168.1.187) and it is accessible for other machines in the LAN. There is data in the collection. I have tried with several versions of dependencies. My versions:

  • hadoop: Hadoop 2.0.0-cdh4.5.0

  • mongo: 3.0.7

Here is the POM of my maven project:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.company.test</groupId>
    <artifactId>bigdata-light</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <packaging>jar</packaging>

    <name>bigdata-light</name>
    <url>http://maven.apache.org</url>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>

    <repositories>
        <repository>
            <id>cloudera</id>
            <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
        </repository>
    </repositories>

    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>3.8.1</version>
            <scope>test</scope>
        </dependency>


        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.0.0-cdh4.5.0</version>
        </dependency>

        <dependency>
            <groupId>org.mongodb.mongo-hadoop</groupId>
            <artifactId>mongo-hadoop-core</artifactId>
            <version>1.4.2</version>
        </dependency>

        <dependency>
            <groupId>org.mongodb</groupId>
            <artifactId>mongo-java-driver</artifactId>
            <version>3.0.3</version>
        </dependency>
    </dependencies>
    <build>
        <finalName>bigdata-0.0.3-SNAPSHOT</finalName>
        <plugins>
            <plugin>
                <artifactId>maven-antrun-plugin</artifactId>
                <version>1.7</version>
                <dependencies>
                    <dependency>
                        <groupId>org.apache.ant</groupId>
                        <artifactId>ant-jsch</artifactId>
                        <version>1.9.2</version>
                    </dependency>
                </dependencies>
                <executions>
                    <execution>
                        <phase>install</phase>
                        <configuration>
                            <target>
                                <ant antfile="${basedir}\build.xml">
                                    <target name="upload" />
                                </ant>
                            </target>
                        </configuration>
                        <goals>
                            <goal>run</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

</project>

Please, any help will be really appreciated =). I have been stucked for several days... :$

I've found a solution, to help people that might have the same problem.
To read from MongoDB collection use

MapredMongoConfigUtil.setInputFormat(getConf(), com.mongodb.hadoop.mapred.MongoInputFormat.class);

instead of

MapredMongoConfigUtil.setInputFormat(getConf(), com.mongodb.hadoop.mapred.BSONFileInputFormat.class);

(that is the alternative to read directly from a the .bson files produced by mongodump ) in the mapreduce configuration class ( EnronMail.java ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM