简体   繁体   English

猪的Java UDF日期正则表达式提取器?

[英]Java UDF Date Regex Extractor for Pig?

I am trying to create a UDF for importing into Pig that matches a Regex pattern on a date. 我正在尝试创建一个UDF以导入到与日期匹配的Regex模式匹配的Pig中。 The Regex has been tested and works accordingly, but I am having trouble with the following code: 正则表达式已经过测试并且可以正常工作,但是我在使用以下代码时遇到了麻烦:

package com.date.format;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
public class DATERANGE extends EvalFunc<String> {

@Override
public String exec(Tuple arg0) throws IOException {
       try
        {
            String pattern = "(Oct\\W(?:1[5-9]|2[0-3])\\W(?:(?:0?9|10):\\d{2}:\\d{2}|11:00:00))";
            Pattern pat = Pattern.compile(pattern);
            Matcher match = pat.matcher((String) arg0.get(0));
            if(match.find())
            {
                return match.group(0);
            }
            else return "none";
        }
        catch(Exception e)
        {
            throw new IOException("Caught exception processing input row ", e);
        }
    }
}

After compiling the above java code and exporting it as a jar and running it inside Hadoop using the following Pig script: 编译完上述Java代码并将其导出为jar并使用以下Pig脚本在Hadoop中运行它之后:

register 'DATEFormat.jar';

ld = LOAD 'dates/date_data_three' AS (date:chararray);
loop = foreach ld generate com.date.format.DATERANGE(date) as d:chararray;
dump loop;

I get the following error: 我收到以下错误:

ERROR 2078: Caught error from UDF: com.date.format.DATERANGE [Caught exception   
processing  input row ]
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator     
for alias loop
at org.apache.pig.PigServer.openIterator(PigServer.java:912)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:752)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372)
at org.apache.pig.tools.grunt.GruntParser.loadScript(GruntParser.java:566)
at org.apache.pig.tools.grunt.GruntParser.processScript(GruntParser.java:513)
at          org.apache.pig.tools.pigscript.parser.PigScriptParser.
Script(PigScriptParser.java:1014)    
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:550)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:228)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:203)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
at org.apache.pig.Main.run(Main.java:542)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store alias loop
at org.apache.pig.PigServer.storeEx(PigServer.java:1015)
at org.apache.pig.PigServer.store(PigServer.java:974)
at org.apache.pig.PigServer.openIterator(PigServer.java:887)
... 16 more

The data file contains dates as shown below: 数据文件包含日期,如下所示:

Wed Oct 15 09:26:09 BST 2014
Wed Oct 15 19:26:09 BST 2014
Wed Oct 18 08:26:09 BST 2014
Wed Oct 23 10:26:09 BST 2014
Sun Oct 05 09:26:09 BST 2014
Wed Nov 20 19:26:09 BST 2014

Does anybody know the correct way to implement a Java UDF for Pig that would work with the Regex I have provided? 有谁知道正确的方法来实现可以与我提供的Regex一起使用的Pig的Java UDF?

Thanks 谢谢

I recommend you to use REGEX_EXTRACT build-in command, this will be very easy instead of writing UDF. 我建议您使用REGEX_EXTRACT内置命令,这将非常容易,而不是编写UDF。

ld = LOAD 'input.txt' AS (date:chararray);
loop = foreach ld generate REGEX_EXTRACT(date,'(Oct\\W(?:1[5-9]|2[0-3])\\W(?:(?:0?9|10):\\d{2}:\\d{2}|11:00:00))',1) as d:chararray;
C = FILTER loop by d is not null;
D = FOREACH C GENERATE $0;
DUMP D;

Output: 输出:

(Oct 15 09:26:09)
(Oct 23 10:26:09)

Your Regex UDF also working fine for me. 您的Regex UDF对我也很好。 i just copied your input and java code and executed locally. 我只是复制了您的输入和Java代码并在本地执行。 It works perfectly. 它运作完美。 Please see the below output that i got from your UDF code. 请查看我从您的UDF代码获得的以下输出。 I guess you may need to check your classpath are properly set or not. 我想您可能需要检查您的类路径是否正确设置。

(Oct 15 09:26:09)
(none)
(none)
(Oct 23 10:26:09)
(none)
(none) 

Even better, you could use ToDate : 更好的是,您可以使用ToDate

load your data into filtered_raw_financings_csvs with close_date as a chararray: 您的数据加载到filtered_raw_financings_csvs与close_date作为chararray:

financings_csvs = FOREACH filtered_raw_financings_csvs
                  GENERATE name,
                           city,
                           state,
                           (close_date==''?NULL:ToDate(close_date, 'dd-MMM-yy')) AS close_date
;

Build your date format string as described here: 按照此处所述构建日期格式字符串:

http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html

This snippet is shown in context here: 此片段显示在上下文中:

http://nathan.vertile.com/blog/2015/04/17/handling-dates-in-hadoop-pig/ http://nathan.vertile.com/blog/2015/04/17/handling-dates-in-hadoop-pig/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM