[英]Optimizing filter function in PIG
I have written a filter function for my pig script but my job is taking too much time. 我已经为猪脚本编写了过滤器功能,但是我的工作花了太多时间。 Amount of Data being processed is 15 GB on a 5 node cluster. 在5节点群集上,正在处理的数据量为15 GB。
Can anybody suggest how to optimize my code: 有人可以建议如何优化我的代码:
package org.apache.pig.builtin;
import java.util.*;
import java.io.IOException;
import java.util.Map;
import org.apache.pig.FilterFunc;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.DataType;
import org.apache.pig.impl.util.WrappedIOException;
public class filterIP extends FilterFunc {
ArrayList<String> Ar1=new ArrayList<String>(){
{
add("151.193.220.28");
....
//Around 2000 IP's to be filtered
add("129.22.63.207");
}
};
public Boolean exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return true;
try {
Object values = input.get(0);
if (values instanceof DataBag)
return ((DataBag)values).size() == 0;
else if (values instanceof Map)
return ((Map)values).size() == 0;
else if (values instanceof String){
for(String s:Ar1){
if(((String)values).matches(".*"+s+".*"))
return false;
}
return true;
//return !((String)values).matches(".*"+Ar1.get(1)+".*");
}
else{
return false;
// throw new IOException("Cannot test a " + DataType.findTypeName(values) + " for required match.");
}
} catch (ExecException ee) {
throw WrappedIOException.wrap("Caught exception processing input row ", ee);
}
}
}*
Use a HashSet rather than an ArrayList - the for (String s:Ar1)
is your bottleneck. 使用HashSet而不是ArrayList- for (String s:Ar1)
是您的瓶颈。
Are you trying to filter a field by a set of IP addresses?, Can you replace the matches() call with set.contains(values)? 您是否要按一组IP地址过滤字段?是否可以用set.contains(values)替换match()调用? This will only work however if your value only contains the IP address, and no pre / post fix characters - but i'm sure you can write a regex to find the ip address string first, extract it and then check for membership against the hash set. 但是,这仅在您的值仅包含IP地址且不包含前缀/后缀字符的情况下起作用-但我敢肯定,您可以编写一个正则表达式来首先找到ip地址字符串,将其提取,然后根据哈希检查成员资格组。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.