简体   繁体   中英

Hive UDF in Java fails when creating a table

What is the difference between those two queries:

SELECT my_fun(col_name) FROM my_table;

and

CREATE TABLE new_table AS SELECT my_fun(col_name) FROM my_table;

Where my_fun is a java UDF.

I'm asking, because when I create new table (second query) I receive a java error.

Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: Map operator initialization failed
...
Caused by: org.apache.hadoop.hive.ql.exec.UDFArgumentException: Unable to instantiate UDF implementation class com.company_name.examples.ExampleUDF: java.lang.NullPointerException

I found that the source of error is line in my java file:

encoded = Files.readAllBytes(Paths.get(configPath));

But the question is why it works when table is not created and fails if table is created?

The problem might be with the way you read the file. Try to pass the file path as the second argument in the UDF, then read as follows

private BufferedReader getReaderFor(String filePath) throws HiveException {
    try {
        Path fullFilePath = FileSystems.getDefault().getPath(filePath);
        Path fileName = fullFilePath.getFileName();
        if (Files.exists(fileName)) {
            return Files.newBufferedReader(fileName, Charset.defaultCharset());
        }
        else
        if (Files.exists(fullFilePath)) {
            return Files.newBufferedReader(fullFilePath, Charset.defaultCharset());
        }
        else {
            throw new HiveException("Could not find \"" + fileName + "\" or \"" + fullFilePath + "\" in inersect_file() UDF.");
        }
    }
    catch(IOException exception) {
        throw new HiveException(exception);
    }
}

private void loadFromFile(String filePath) throws HiveException {
    set = new HashSet<String>();

    try (BufferedReader reader = getReaderFor(filePath)) {
        String line;
        while((line = reader.readLine()) != null) {
            set.add(line);
        }
    } catch (IOException e) {
        throw new HiveException(e);
    }
}

The full code for different generic UDF that utilizes file reader can be found here

I think there are several points unclear, so this answer is based on assumptions.

First of all, it is important to understand that hive currently optimize several simple queries and depending on the size of your data, the query that is working for you SELECT my_fun(col_name) FROM my_table; is most likely running locally from the client where you are executing the job, that is why you UDF can access your config file locally available, this "execution mode" is because the size of your data. CTAS trigger a job independent on the input data, this job runs distributed in the cluster where each worker fail accessing your config file.

It looks like you are trying to read your configuration file from the local file system, not from the HDSFS Files.readAllBytes(Paths.get(configPath)) , this means that your configuration has to either be replicated in all the worker nodes or be added previously to the distributed cache (you can use add file from this, doc here . You can find another questions here about accessing files from the distributed cache from UDFs.

One additional problem is that you are passing the location of your config file through an environment variable which is not propagated to worker nodes as part of your hive job. You should pass this configuration as a hive config, there is an answer for accessing Hive Config from UDF here assuming that you are extending GenericUDF.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM