将外部本机库（.so）和外部jar与Hadoop MapReduce结合使用

Question

I'm new to the hadoop / java world, so please be gentle and feel free to correct egregious errors. 我是hadoop / java领域的新手，请保持柔和并随时纠正严重错误。 I'm trying to use native libraries compiled on my ubuntu machine running Hadoop locally (standalone mode). 我正在尝试使用在本地运行Hadoop（独立模式）的ubuntu机器上编译的本机库。 I'm also trying to use an external .jar in addition to the .jar I have compiled. 除了我已编译的.jar之外，我还尝试使用外部.jar。 I tried making a fatjar unsuccessfully and decided I would attempt to pass the external jar and native library to hadoop via the command line. 我尝试制作失败的Fatjar，并决定尝试通过命令行将外部jar和本机库传递给hadoop。 The libraries are used in a custom record reader I created. 这些库在我创建的自定义记录读取器中使用。 I am able to run mapreduce jobs without external libraries via the hadoop command. 我可以通过hadoop命令在没有外部库的情况下运行mapreduce作业。 I am also able to run this program in eclipse when I set the LD_LIBRARY_PATH class variable. 设置LD_LIBRARY_PATH类变量时，我也可以在Eclipse中运行该程序。 I'm unsure of variables that would need to be set to run this job successfully in hadoop, so please tell me if some are necessary, though I have tried setting $HADOOP_CLASSPATH. 我不确定要在hadoop中成功运行此作业需要设置的变量，因此尽管我尝试设置$ HADOOP_CLASSPATH，也请告诉我是否有必要。

ie 即

./bin/hadoop jar ~/myjar/cdf-11-16.jar CdfInputDriver -libjars cdfjava.jar -files libcdf.so,libcdfNativeLibrary.so input output

I've tried accessing the jar and so files from my local and copying them to HDFS. 我试过访问jar并从本地访问文件，然后将其复制到HDFS。

I get the following error from the job: 我从工作中得到以下错误：

Exception in thread "main" java.lang.NoClassDefFoundError: gsfc/nssdc/cdf/CDFConstants
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:274)
    at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:1844)
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1809)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1903)
    at org.apache.hadoop.mapreduce.task.JobContextImpl.getInputFormatClass(JobContextImpl.java:174)
    at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:490)
    at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:510)
    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
    at CdfInputDriver.run(CdfInputDriver.java:45)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at CdfInputDriver.main(CdfInputDriver.java:50)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.lang.ClassNotFoundException: gsfc.nssdc.cdf.CDFConstants
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    ... 36 more

I've tried seeing if the files were loaded in cache with the following code and it "cache files:" prints as null: 我尝试查看文件是否已使用以下代码加载到缓存中，并且它“缓存文件：”显示为null：

public class CdfInputDriver extends Configured implements Tool{

    @Override
    public int run(String[] args) throws Exception {

        Job job = Job.getInstance(getConf());

        System.out.println("cache files:" + getConf().get("mapreduce.job.cache.files"));
        Path[] uris = job.getLocalCacheFiles();
        for(Path uri: uris){

              System.out.println(uri.toString());
              System.out.println(uri.getName());            

        } 
        job.setJarByClass(getClass());

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        job.setInputFormatClass(CdfInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapperClass(CdfMapper.class);
        //job.setReducerClass(WordCount.IntSumReducer.class);

        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args)  throws Exception,
        InterruptedException, ClassNotFoundException {
        int exitCode = ToolRunner.run(new CdfInputDriver(), args);
        System.exit(exitCode);
    }
}

Also, I am just testing this to inevitably run the job on Amazon EMR. 另外，我只是在测试这一点，以不可避免地在Amazon EMR上运行该作业。 Would storing the .so and .jar on S3 and using a similar method theoretically work? 在理论上可以将.so和.jar存储在S3上并使用类似的方法吗？

Appreciate any help! 感谢任何帮助！

Answer 1

I figured this out for people having issues with this. 我想出了这个问题的人。 For my scenario, it was a combination of issues. 对于我的情况，这是多个问题的组合。

./bin/hadoop jar ~/myjar/cdf-11-16.jar CdfInputDriver -libjars cdfjava.jar -files libcdf.so,libcdfNativeLibrary.so input output

A couple things had thrown me in a loop. 有几件事使我陷入混乱。 Here are some things I checked. 这是我检查过的一些东西。 If anyone has factual info on why these contributed to it working, it would be appreciated. 如果有人知道为什么这些因素对它起作用的事实信息，将不胜感激。

(for the linux newbie) If you are running hadoop using sudo make sure to include -E to include environment variables. （对于linux新手）如果使用sudo运行hadoop，请确保包含-E以包含环境变量。

Make sure the third party .jar library is located on your masternode. 确保第三方.jar库位于您的主节点上。 (seemed to be necessary but haven't confirmed with documentation... maybe my enviro variables were incorrect otherwise) （似乎是必要的，但尚未得到文档的确认。。。否则我的环境变量可能不正确）

I was able to run this using Amazon EMR. 我能够使用Amazon EMR运行它。 I uploaded .so files and .jars to s3, ssh'd into the master node of the cluster, installed s3cmd via http://blog.adaptovate.com/2013/06/installing-s3cmd-on-ec2-so-that-yum.html , copied cdf-11-16.jar (mapreduce jar) and cdfjava.jar (third party jar) to the masternode with s3cmd get, and ran the job. 我将.so文件和.jars上传到s3，ssh'd到集群的主节点，并通过http://blog.adaptovate.com/2013/06/installing-s3cmd-on-ec2-so-that安装了s3cmd -yum.html ，将cdf-11-16.jar（mapreduce jar）和cdfjava.jar（第三方jar）复制到具有s3cmd get的主节点，然后运行该作业。 I was able to reference to .so files on S3. 我能够在S3上引用.so文件。

将外部本机库（.so）和外部jar与Hadoop MapReduce结合使用

问题描述

1 个解决方案

解决方案1
0 2014-11-17 14:19:06

将外部本机库（.so）和外部jar与Hadoop MapReduce结合使用

问题描述

1 个解决方案

解决方案1 0 2014-11-17 14:19:06

解决方案1
0 2014-11-17 14:19:06