繁体   English   中英

我的Java正则表达式无法正常工作

[英]My Java regex doesn't work properly

我编写了一个正则表达式,如下所示,该表达式用于从字符串中提取日期:

(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)(\*){0,2}\s+\d{1,2}\s+(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{4}

在转换为Java正则表达式之前,我在这里测试了http://regexr.com?35vlm

结果看起来没有问题,它符合我的要求。

“ el”对象是一个字符串类型arraylist

holiday: New Year's Day Wednesday 1 January 2014
holiday: Chinese New Year Friday 31 January 2014 Saturday 1 February 2014
holiday: Good Friday Friday 18 April 2014
holiday: Labour Day Thursday 1 May 2014
holiday: Vesak Day Tuesday 13 May 2014
holiday: Hari Raya Puasa Monday 28 July 2014
holiday: National Day  Saturday 9 August 2014
holiday: Hari Raya Haji  Sunday* 5 October 2014
holiday: Deepavali  Thursday** 23 October 2014
holiday: Christmas Day Thursday 25 December 2014

问题是在Java中,某些日期缺少,某些与之匹配,我也在http://java-regex-tester.appspot.com/进行了测试,同样的错误。

更新:

我的代码的完整版本:

import java.io.IOException;
import java.text.DecimalFormat;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;


public class Tester {

    /**
     * @param args
     * @throws IOException 
     */
    public static void main(String[] args) throws IOException {

        updateSingaporeHolidayCalendar();
    }

public static void updateSingaporeHolidayCalendar() throws IOException{

        String url = "http://www.mom.gov.sg/employment-practices/leave-and-holidays/Pages/public-holidays-2014.aspx";
        Document document = Jsoup.connect(url).get();

        Elements holidays = document.select("#contentarea table tr");
        // System.out.println("12312312");
        //System.out.println("web page context: " + question);
        List<String> el = new ArrayList<String>();
        for(int i = 2; i < holidays.size() + 1; i++){
            if((i&1) == 1) continue;
            Elements threeGroup = holidays.get(i-2).getElementsByTag("td");

            int j = 2;
            for(Element e : threeGroup){
                if(j-- != 0) continue;
                j = 2;
                el.add(e.text());
            }
        }


        Pattern pattern = Pattern.compile("(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)(\\*){0,2}\\s+\\d{1,2}\\s+(January|February|March|April|May|June|July|August|September|October|November|December)\\s+\\d{4}");

        //out put
        for(int k = 0; k < el.size(); k++){

            Matcher matcher = pattern.matcher(el.get(k));
            // Check all occurrences
            while (matcher.find()) {
                //System.out.print("Start index: " + matcher.start());
                //System.out.print(" End index: " + matcher.end());
                System.out.println(" Found: " + matcher.group());
            }
            System.out.println("holiday: " + el.get(k));
        }

    }

}

外部Jar:JSoup.jar

输出:

  Found: Wednesday 1 January 2014
holiday: New Year's Day Wednesday 1 January 2014
 Found: Saturday 1 February 2014
holiday: Chinese New Year Friday 31 January 2014 Saturday 1 February 2014
holiday: Good Friday Friday 18 April 2014
 Found: Thursday 1 May 2014
holiday: Labour Day Thursday 1 May 2014
holiday: Vesak Day Tuesday 13 May 2014
holiday: Hari Raya Puasa Monday 28 July 2014
holiday: National Day  Saturday 9 August 2014
 Found: Sunday* 5 October 2014
holiday: Hari Raya Haji  Sunday* 5 October 2014
holiday: Deepavali  Thursday** 23 October 2014
 Found: Thursday 25 December 2014
holiday: Christmas Day Thursday 25 December 2014
holiday:  
holiday:  

解决

正如@Pshemo所说,“您从站点获取的数据还包含不间断空间,可以用HTML编写,显然不属于\\ s类。要解决此问题,请用[\\ s \\ u00A0替换每个\\ s ]包含此字符(使用Unicode标识符编写)。”

因此将表达式更改为:

 Pattern pattern = Pattern
        .compile("(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)(\\*){0,2}[\\s\u00A0]+\\d{1,2}[\\s\u00A0]+(January|February|March|April|May|June|July|August|September|October|November|December)[\\s\u00A0]+\\d{4}");

解决了这个问题。

您从网站获取的数据还包含no-break space ,可以将HTML编写为&#160; 显然,它不属于\\\\s类。 要解决此问题,请用[\\\\s\ ]替换每个\\\\s以包含此字符(使用Unicode标识符编写)。

所以你的正则表达式看起来像

Pattern pattern = Pattern
        .compile("(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)(\\*){0,2}[\\s\u00A0]+\\d{1,2}[\\s\u00A0]+(January|February|March|April|May|June|July|August|September|October|November|December)[\\s\u00A0]+\\d{4}");

好吧,我可以肯定地说您会错过最后两个,因为您的迭代会缩短两个元素,即更改:

k < el.size() - 2

k < el.size()

或更妙的是,使用foreach:

for (String s : el) {
    Matcher matcher = pattern.matcher(a);
    // ...
}

您的正则表达式看起来还可以。

您的for循环for(int k = 0; k < el.size() - 2; k++)仅限于el.size() - 2尝试删除-2以遍历列表中的所有元素

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM