简体   繁体   English

优化PIG中的过滤器功能

[英]Optimizing filter function in PIG

I have written a filter function for my pig script but my job is taking too much time. 我已经为猪脚本编写了过滤器功能,但是我的工作花了太多时间。 Amount of Data being processed is 15 GB on a 5 node cluster. 在5节点群集上,正在处理的数据量为15 GB。

Can anybody suggest how to optimize my code: 有人可以建议如何优化我的代码:

     package org.apache.pig.builtin;
    import java.util.*;
    import java.io.IOException;
    import java.util.Map;
    import org.apache.pig.FilterFunc;
    import org.apache.pig.backend.executionengine.ExecException;
    import org.apache.pig.data.DataBag;
    import org.apache.pig.data.Tuple;
    import org.apache.pig.data.DataType;
    import org.apache.pig.impl.util.WrappedIOException;

   public class filterIP extends FilterFunc {
     ArrayList<String> Ar1=new ArrayList<String>(){
    {
add("151.193.220.28");
....
//Around 2000 IP's to be filtered
add("129.22.63.207");
    }
    };
       public Boolean exec(Tuple input) throws IOException {
           if (input == null || input.size() == 0)
               return true;
           try {
               Object values = input.get(0);
               if (values instanceof DataBag)
                   return ((DataBag)values).size() == 0;
               else if (values instanceof Map)
                   return ((Map)values).size() == 0;
           else if (values instanceof String){

            for(String s:Ar1){
            if(((String)values).matches(".*"+s+".*"))
            return false;
            }
            return true;
            //return !((String)values).matches(".*"+Ar1.get(1)+".*");
            }
               else{
            return false;
                  // throw new IOException("Cannot test a " + DataType.findTypeName(values) + " for required match.");
               }
           } catch (ExecException ee) {
               throw WrappedIOException.wrap("Caught exception processing input row ", ee);
           }
       }
   }*

Use a HashSet rather than an ArrayList - the for (String s:Ar1) is your bottleneck. 使用HashSet而不是ArrayList- for (String s:Ar1)是您的瓶颈。

Are you trying to filter a field by a set of IP addresses?, Can you replace the matches() call with set.contains(values)? 您是否要按一组IP地址过滤字段?是否可以用set.contains(values)替换match()调用? This will only work however if your value only contains the IP address, and no pre / post fix characters - but i'm sure you can write a regex to find the ip address string first, extract it and then check for membership against the hash set. 但是,这仅在您的值仅包含IP地址且不包含前缀/后缀字符的情况下起作用-但我敢肯定,您可以编写一个正则表达式来首先找到ip地址字符串,将其提取,然后根据哈希检查成员资格组。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM