简体   繁体   English

带有外部库的Hadoop Hive UDF

[英]Hadoop Hive UDF with external library

I'm trying to write a UDF for Hadoop Hive, that parses User Agents. 我正在尝试为Hadoop Hive编写UDF,以解析用户代理。 Following code works fine on my local machine, but on Hadoop I'm getting: 以下代码在我的本地计算机上运行良好,但是在Hadoop上,我得到了:

org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public java.lang.String MyUDF .evaluate(java.lang.String) throws org.apache.hadoop.hive.ql.metadata.HiveException on object MyUDF@64ca8bfb of class MyUDF with arguments {All Occupations:java.lang.String} of size 1', org.apache.hadoop.hive.ql.metadata.HiveException:无法执行方法public java.lang.String MyUDF .evaluate(java.lang.String)对对象抛出org.apache.hadoop.hive.ql.metadata.HiveException MyUDF类的MyUDF @ 64ca8bfb,其参数为{All Occupations:java.lang.String},大小为1',

Code: 码:

import java.io.IOException;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.*;
import com.decibel.uasparser.OnlineUpdater;
import com.decibel.uasparser.UASparser;
import com.decibel.uasparser.UserAgentInfo;

public class MyUDF extends UDF {

    public String evaluate(String i) {
        UASparser parser = null;         
        parser = new UASparser(); 
        String key = "";
        OnlineUpdater update = new OnlineUpdater(parser, key);
        UserAgentInfo info = null;
        info = parser.parse(i);
        return info.getDeviceType();
    }
}

Facts that come to my mind I should mention: 我想到的事实是:

  • I'm compiling with Eclipse with "export runnable jar file" and extract required libraries into generated jar option 我正在使用带有“导出可运行的jar文件”的Eclipse进行编译,并将所需的库提取到生成的jar选项中

  • I'm uploading this "fat jar" file with Hue 我正在用Hue上传此“胖罐”文件

  • Minimum working example I managed to run: 我设法运行的最小工作示例:

    public String evaluate(String i) { return "hello" + i.toString()"; }

  • I guess the problem lies somewhere around that library (downloaded from https://udger.com ) I'm using, but I have no idea where. 我猜问题出在我正在使用的那个库的某个地方(从https://udger.com下载),但是我不知道在哪里。

Any suggestions? 有什么建议么?

Thanks, Michal 谢谢,米哈尔

It could be a few things. 可能是几件事。 Best thing is to check the logs, but here's a list of a few quick things you can check in a minute. 最好的办法是检查日志,但这是您可以在一分钟内检查的一些快速事项的列表。

  1. jar does not contain all dependencies. jar不包含所有依赖项。 I am not sure how eclipse builds a runnable jar, but it may not include all dependencies. 我不确定eclipse如何构建可运行的jar,但是它可能不包括所有依赖项。 You can do 你可以做

    jar tf your-udf-jar.jar jar tf your-udf-jar.jar

to see what was included. 看看其中包含什么。 You should see stuff from com.decibel.uasparser . 您应该从com.decibel.uasparser看到内容。 If not, you have to build the jar with the appropriate dependencies (usually you do that using maven). 如果不是,则必须使用适当的依赖关系来构建jar(通常使用maven来执行此操作)。

  1. Different version of the JVM. 不同版本的JVM。 If you compile with jdk8 and the cluster runs jdk7, it would also fail 如果使用jdk8进行编译,并且集群运行jdk7,则它也会失败

  2. Hive version. 蜂巢版。 Sometimes the Hive APIs change slightly, enough to be incompatible. 有时,Hive API会稍有变化,以至于不兼容。 Probably not the case here, but make sure to compile the UDF against the same version of hadoop and hive that you have in the cluster 此处可能不是这种情况,但请确保针对集群中具有的相同版本的hadoop和hive编译UDF

  3. You should always check if info is null after the call to parse() 调用parse()后,您应始终检查info是否为null

  4. looks like the library uses a key, meaning that actually gets data from an online service (udger.com), so it may not work without an actual key. 图书馆似乎使用了密钥,这意味着实际上是从在线服务(udger.com)获取数据,因此如果没有实际的密钥,它可能无法工作。 Even more important, the library updates online, contacting the online service for each record . 更重要的是,图书馆会在线更新,并为每条记录联系在线服务。 This means, looking at the code, that it will create one update thread per record . 这意味着,看着代码,它将为每个记录创建一个更新线程 You should change the code to do that only once in the constructor like the following: 您应该更改代码,使其仅在构造函数中执行一次,如下所示:

Here's how to change it: 更改方法如下:

public class MyUDF extends UDF {
  UASparser parser = new UASparser();

  public MyUDF() {
    super()
    String key = "PUT YOUR KEY HERE";
    // update only once, when the UDF is instantiated
    OnlineUpdater update = new OnlineUpdater(parser, key);
  }

  public String evaluate(String i) {
        UserAgentInfo info = parser.parse(i);
        if(info!=null) return info.getDeviceType();
        // you want it to return null if it's unparseable
        // otherwise one bad record will stop your processing
        // with an exception
        else return null; 
    }
}

But to know for sure, you have to look at the logs...yarn logs, but also you can look at the hive logs on the machine you're submitting the job on ( probably in /var/log/hive but it depends on your installation). 但是要确定,您必须查看日志...纱线日志,还可以查看提交作业的机器上的配置单元日志(可能在/ var / log / hive中,但这取决于在您的安装上)。

such a problem probably can be solved by steps: 这样的问题可能可以通过以下步骤解决:

  1. overide the method UDF.getRequiredJars() , make it returning a hdfs file path list which values are determined by where you put the following xxx_lib folder into your hdfs. 覆盖方法UDF.getRequiredJars() ,使其返回hdfs文件路径列表,其值取决于将以下xxx_lib文件夹放入hdfs的位置。 Note that , the list mist exactly contains each jar's full hdfs path strings ,such as hdfs://yourcluster/some_path/xxx_lib/some.jar 请注意,列表薄雾完全包含每个jar的完整hdfs路径字符串,例如hdfs://yourcluster/some_path/xxx_lib/some.jar

  2. export your udf code by following "Runnable jar file exporting wizard" (chose "copy required libraries into a sub folder next to the generated jar". This steps will result in a xxx.jar and a lib folder xxx_lib next to xxx.jar 导出您的udf按照“运行的JAR文件导出向导”码(选择“复制所需的库到旁边的生成JAR的子文件夹”,将导致xxx.jar,毗邻xxx.jar lib文件夹xxx_lib此步骤

  3. put xxx.jar and the folders xxx_lib to your hdfs filesystem according to your code in step 0. 根据步骤0中的代码将xxx.jar和文件夹xxx_lib放入您的hdfs文件系统。

  4. create a udf using: add jar ${the-xxx.jar-hdfs-path}; 使用以下命令创建udf:添加jar $ {the-xxx.jar-hdfs-path}; create function your-function as $}qualified name of udf class}; 创建函数your-function作为$} udf类的合格名称};

Try it. 试试吧。 I test this and it works 我测试了它并且有效

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM