I am trying to write a Java program or Hadoop Pig script which will take a parameter of comma separated strings (eg abc, def, xyz
) and filter a file for the records which contain one or more of these strings.
Eg
Input File:
1 abctree
2 pqrwewe
3 rtrxyz45
4 abcxyz
5 234rt23
Input parameter is: abc, def, xyz
Expected output:
1 abctree
3 rtrxyz45
4 abcxyz
I am able to write the script which filters the file on 1 string, using matches
, but don't know how to do that for multiple strings. Do I need to write a UDF for this?
I have added the Java tag to this question, because as per my initial findings I will have to write a UDF which will be written in Java. So if anyone knows a way to write this in Java, please post your solutions.
I have figured it out:
B = filter A by (n matches '.*string1.*' or n matches '.*string2.*' or n matches '.*string3.*');
This does the trick.
However, for my requirement, I will be accepting a "comma-separated" input from the command-line, eg string1, string2, string3
. So the next task is to somehow separate individual strings and use them in the above expression. If anyone knows how to do it (especially without UDFs), please post.
I don't know about Pig, but in Java you could use something like this:
String[] words = input.split("[\\s,]+");
String line;
while((line = file.readLine()) != null){
for(String word : words){
if(line.contains(word)){
System.out.print(line);
break;
}
}
}
contains
is enough to find the words. You could make a regex based on the input
string and match on that. The expression would look like foo|bar|baz
, but you need to escape meta characters so they will be literal during the match, which can be done with java.util.regex.Pattern.quote
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.