Pig script/command to filter a file on multiple strings

Question

I am trying to write a Java program or Hadoop Pig script which will take a parameter of comma separated strings (eg abc, def, xyz ) and filter a file for the records which contain one or more of these strings.

Eg

Input File:

1    abctree
2    pqrwewe
3    rtrxyz45
4    abcxyz
5    234rt23

Input parameter is: abc, def, xyz

Expected output:

1    abctree
3    rtrxyz45
4    abcxyz

I am able to write the script which filters the file on 1 string, using matches , but don't know how to do that for multiple strings. Do I need to write a UDF for this?

I have added the Java tag to this question, because as per my initial findings I will have to write a UDF which will be written in Java. So if anyone knows a way to write this in Java, please post your solutions.

Answer 1

I have figured it out:

B = filter A by (n matches '.*string1.*' or n matches '.*string2.*' or n matches '.*string3.*');

This does the trick.

However, for my requirement, I will be accepting a "comma-separated" input from the command-line, eg string1, string2, string3 . So the next task is to somehow separate individual strings and use them in the above expression. If anyone knows how to do it (especially without UDFs), please post.

Answer 2

I don't know about Pig, but in Java you could use something like this:

String[] words = input.split("[\\s,]+");

String line;
while((line = file.readLine()) != null){
    for(String word : words){
        if(line.contains(word)){
            System.out.print(line);
            break;
        }
    }
}

contains is enough to find the words. You could make a regex based on the input string and match on that. The expression would look like foo|bar|baz , but you need to escape meta characters so they will be literal during the match, which can be done with java.util.regex.Pattern.quote .

Pig script/command to filter a file on multiple strings

Question

2 answers

solution1
1 ACCPTED 2012-03-25 21:19:54

solution2
-2 2012-03-24 05:31:46

Pig script/command to filter a file on multiple strings

Question

2 answers

solution1 1 ACCPTED 2012-03-25 21:19:54

solution2 -2 2012-03-24 05:31:46

solution1
1 ACCPTED 2012-03-25 21:19:54

solution2
-2 2012-03-24 05:31:46