[英]Pig script/command to filter a file on multiple strings
I am trying to write a Java program or Hadoop Pig script which will take a parameter of comma separated strings (eg abc, def, xyz
) and filter a file for the records which contain one or more of these strings. 我正在尝试编写一个Java程序或Hadoop Pig脚本,该脚本将使用逗号分隔的字符串(例如
abc, def, xyz
)作为参数,并为包含一个或多个这些字符串的记录过滤文件。
Eg 例如
Input File: 输入文件:
1 abctree
2 pqrwewe
3 rtrxyz45
4 abcxyz
5 234rt23
Input parameter is: abc, def, xyz
输入参数为:
abc, def, xyz
Expected output: 预期产量:
1 abctree
3 rtrxyz45
4 abcxyz
I am able to write the script which filters the file on 1 string, using matches
, but don't know how to do that for multiple strings. 我能够编写脚本,使用
matches
在1个字符串上过滤文件,但不知道如何对多个字符串执行此操作。 Do I need to write a UDF for this? 我是否需要为此编写UDF?
I have added the Java tag to this question, because as per my initial findings I will have to write a UDF which will be written in Java. 我已将Java标记添加到此问题,因为根据我的初步发现,我将必须编写将用Java编写的UDF。 So if anyone knows a way to write this in Java, please post your solutions.
因此,如果有人知道用Java编写此方法的方法,请发布您的解决方案。
I have figured it out: 我已经弄清楚了:
B = filter A by (n matches '.*string1.*' or n matches '.*string2.*' or n matches '.*string3.*');
This does the trick. 这可以解决问题。
However, for my requirement, I will be accepting a "comma-separated" input from the command-line, eg string1, string2, string3
. 但是,根据我的要求,我将从命令行接受“逗号分隔”输入,例如
string1, string2, string3
。 So the next task is to somehow separate individual strings and use them in the above expression. 因此,下一个任务是以某种方式分离各个字符串,并在上面的表达式中使用它们。 If anyone knows how to do it (especially without UDFs), please post.
如果有人知道该怎么做(尤其是没有UDF的人),请发布。
I don't know about Pig, but in Java you could use something like this: 我不了解Pig,但是在Java中,您可以使用以下代码:
String[] words = input.split("[\\s,]+");
String line;
while((line = file.readLine()) != null){
for(String word : words){
if(line.contains(word)){
System.out.print(line);
break;
}
}
}
contains
is enough to find the words. contains
足以找到单词。 You could make a regex based on the input
string and match on that. 您可以根据
input
字符串创建一个正则表达式,然后对其进行匹配。 The expression would look like foo|bar|baz
, but you need to escape meta characters so they will be literal during the match, which can be done with java.util.regex.Pattern.quote
. 该表达式看起来像
foo|bar|baz
,但是您需要转义元字符,以便它们在比赛期间是原义的,这可以通过java.util.regex.Pattern.quote
完成。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.