简体   繁体   English

Pig脚本/命令来过滤多个字符串上的文件

[英]Pig script/command to filter a file on multiple strings

I am trying to write a Java program or Hadoop Pig script which will take a parameter of comma separated strings (eg abc, def, xyz ) and filter a file for the records which contain one or more of these strings. 我正在尝试编写一个Java程序或Hadoop Pig脚本,该脚本将使用逗号分隔的字符串(例如abc, def, xyz )作为参数,并为包含一个或多个这些字符串的记录过滤文件。

Eg 例如

Input File: 输入文件:

1    abctree
2    pqrwewe
3    rtrxyz45
4    abcxyz
5    234rt23

Input parameter is: abc, def, xyz 输入参数为: abc, def, xyz

Expected output: 预期产量:

1    abctree
3    rtrxyz45
4    abcxyz

I am able to write the script which filters the file on 1 string, using matches , but don't know how to do that for multiple strings. 我能够编写脚本,使用matches在1个字符串上过滤文件,但不知道如何对多个字符串执行此操作。 Do I need to write a UDF for this? 我是否需要为此编写UDF?

I have added the Java tag to this question, because as per my initial findings I will have to write a UDF which will be written in Java. 我已将Java标记添加到此问题,因为根据我的初步发现,我将必须编写将用Java编写的UDF。 So if anyone knows a way to write this in Java, please post your solutions. 因此,如果有人知道用Java编写此方法的方法,请发布您的解决方案。

I have figured it out: 我已经弄清楚了:

B = filter A by (n matches '.*string1.*' or n matches '.*string2.*' or n matches '.*string3.*');

This does the trick. 这可以解决问题。

However, for my requirement, I will be accepting a "comma-separated" input from the command-line, eg string1, string2, string3 . 但是,根据我的要求,我将从命令行接受“逗号分隔”输入,例如string1, string2, string3 So the next task is to somehow separate individual strings and use them in the above expression. 因此,下一个任务是以某种方式分离各个字符串,并在上面的表达式中使用它们。 If anyone knows how to do it (especially without UDFs), please post. 如果有人知道该怎么做(尤其是没有UDF的人),请发布。

I don't know about Pig, but in Java you could use something like this: 我不了解Pig,但是在Java中,您可以使用以下代码:

String[] words = input.split("[\\s,]+");

String line;
while((line = file.readLine()) != null){
    for(String word : words){
        if(line.contains(word)){
            System.out.print(line);
            break;
        }
    }
}

contains is enough to find the words. contains足以找到单词。 You could make a regex based on the input string and match on that. 您可以根据input字符串创建一个正则表达式,然后对其进行匹配。 The expression would look like foo|bar|baz , but you need to escape meta characters so they will be literal during the match, which can be done with java.util.regex.Pattern.quote . 该表达式看起来像foo|bar|baz ,但是您需要转义元字符,以便它们在比赛期间是原义的,这可以通过java.util.regex.Pattern.quote完成。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM