简体   繁体   English

使用regex.pattern过滤列表

[英]Filter a list using regex.pattern

I need to filter a list full of urls using regex.Pattern. 我需要使用regex.Pattern筛选完整的URL列表。 For now I have this one for types. 现在,我有这个用于类型。

private final static Pattern FILTERS_TYPE = Pattern.compile(".*(\\
(css|js|bmp|ico|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4"
    + "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz|jsp))$");

So I'm trying to write a filter to exclude sites like "facebook" , "twitter" ect. 因此,我试图编写一个过滤器以排除“ facebook”,“ twitter”等网站。

private final static Pattern FILTERS_NAME =       Pattern.compile(".*facebook.*|.*quotidiani.*|.*meteo.*|.*twitter.*|.*hotel.*|.*mobile.*|"         + ".*histats:*"); 

but this one doesn't work. 但这不起作用。 What is the correct syntax for "filters_name"? “ filters_name”的正确语法是什么?

private List<WebURL> trash = new ArrayList<>(); //non obiettivo
private List<WebURL> urls = new ArrayList<>(); //obiettivo



public synchronized void collectorUrls(){

    for(int i =0; i<urls.size();i++){
        String indirizzo = urls.get(i).getURL().toLowerCase();
        if(FILTERS_TYPE.matcher(indirizzo).matches()){

            trash.add(urls.get(i));
            urls.remove(i);

            }
        if(FILTERS_NAME.matcher(indirizzo).matches()){


            trash.add(urls.get(i));
            urls.remove(i);

            }
        System.out.println(urls.get(i).getURL());
        }   
}

Use this regex: 使用此正则表达式:

private final static Pattern FILTERS_NAME =  
         Pattern.compile("facebook|quotidiani|meteo|twitter|hotel|mobile|histats:"); 

Then replace Matcher.matches() method by Matcher.find() . 然后更换Matcher.matches()由方法Matcher.find() So instead of: 所以代替:

if(FILTERS_NAME.matcher(indirizzo).matches()) {...}

Use: 采用:

if(FILTERS_NAME.matcher(indirizzo).find()) {...}

Oh, I think your problem is to remove items from the list while iterating. 哦,我认为您的问题是在迭代时从列表中删除项目。

The value of urls.size() will decrease when you remove a url from the urls list. 当您从网址列表中删除网址时,urls.size()的值将减小。 The result is that the last urls in your list will never be checked. 结果是将永远不会检查列表中的最后一个URL。

Use an iterator for urls list in a while loop. 在while循环中对URL列表使用迭代器。

Explanation: 说明:

urls contains the string " http://facebook.com " and the string "meteo.com". urls包含字符串“ http://facebook.com ”和字符串“ meteo.com”。

  1. iteration step: i == 0 and urls.size() == 2 迭代步骤:i == 0和urls.size()== 2

    string matches url-pattern and urls.remove(0) is called 字符串匹配url-pattern和urls.remove(0)被调用

  2. iteration step: i == 1 and urls.size() == 1 迭代步骤:i == 1和urls.size()== 1

    i is not smaller anymore than urls.size() -> leave for loop, second string in urls will not be checked 我不小于urls.size()->循环播放,不会检查url中的第二个字符串

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM