[英]My Java regex doesn't work properly
我編寫了一個正則表達式,如下所示,該表達式用於從字符串中提取日期:
(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)(\*){0,2}\s+\d{1,2}\s+(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{4}
在轉換為Java正則表達式之前,我在這里測試了http://regexr.com?35vlm
結果看起來沒有問題,它符合我的要求。
“ el”對象是一個字符串類型arraylist :
holiday: New Year's Day Wednesday 1 January 2014
holiday: Chinese New Year Friday 31 January 2014 Saturday 1 February 2014
holiday: Good Friday Friday 18 April 2014
holiday: Labour Day Thursday 1 May 2014
holiday: Vesak Day Tuesday 13 May 2014
holiday: Hari Raya Puasa Monday 28 July 2014
holiday: National Day Saturday 9 August 2014
holiday: Hari Raya Haji Sunday* 5 October 2014
holiday: Deepavali Thursday** 23 October 2014
holiday: Christmas Day Thursday 25 December 2014
問題是在Java中,某些日期缺少,某些與之匹配,我也在http://java-regex-tester.appspot.com/進行了測試,同樣的錯誤。
更新:
我的代碼的完整版本:
import java.io.IOException;
import java.text.DecimalFormat;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Tester {
/**
* @param args
* @throws IOException
*/
public static void main(String[] args) throws IOException {
updateSingaporeHolidayCalendar();
}
public static void updateSingaporeHolidayCalendar() throws IOException{
String url = "http://www.mom.gov.sg/employment-practices/leave-and-holidays/Pages/public-holidays-2014.aspx";
Document document = Jsoup.connect(url).get();
Elements holidays = document.select("#contentarea table tr");
// System.out.println("12312312");
//System.out.println("web page context: " + question);
List<String> el = new ArrayList<String>();
for(int i = 2; i < holidays.size() + 1; i++){
if((i&1) == 1) continue;
Elements threeGroup = holidays.get(i-2).getElementsByTag("td");
int j = 2;
for(Element e : threeGroup){
if(j-- != 0) continue;
j = 2;
el.add(e.text());
}
}
Pattern pattern = Pattern.compile("(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)(\\*){0,2}\\s+\\d{1,2}\\s+(January|February|March|April|May|June|July|August|September|October|November|December)\\s+\\d{4}");
//out put
for(int k = 0; k < el.size(); k++){
Matcher matcher = pattern.matcher(el.get(k));
// Check all occurrences
while (matcher.find()) {
//System.out.print("Start index: " + matcher.start());
//System.out.print(" End index: " + matcher.end());
System.out.println(" Found: " + matcher.group());
}
System.out.println("holiday: " + el.get(k));
}
}
}
外部Jar:JSoup.jar
輸出:
Found: Wednesday 1 January 2014
holiday: New Year's Day Wednesday 1 January 2014
Found: Saturday 1 February 2014
holiday: Chinese New Year Friday 31 January 2014 Saturday 1 February 2014
holiday: Good Friday Friday 18 April 2014
Found: Thursday 1 May 2014
holiday: Labour Day Thursday 1 May 2014
holiday: Vesak Day Tuesday 13 May 2014
holiday: Hari Raya Puasa Monday 28 July 2014
holiday: National Day Saturday 9 August 2014
Found: Sunday* 5 October 2014
holiday: Hari Raya Haji Sunday* 5 October 2014
holiday: Deepavali Thursday** 23 October 2014
Found: Thursday 25 December 2014
holiday: Christmas Day Thursday 25 December 2014
holiday:
holiday:
解決 :
正如@Pshemo所說,“您從站點獲取的數據還包含不間斷空間,可以用HTML編寫,顯然不屬於\\ s類。要解決此問題,請用[\\ s \\ u00A0替換每個\\ s ]包含此字符(使用Unicode標識符編寫)。”
因此將表達式更改為:
Pattern pattern = Pattern
.compile("(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)(\\*){0,2}[\\s\u00A0]+\\d{1,2}[\\s\u00A0]+(January|February|March|April|May|June|July|August|September|October|November|December)[\\s\u00A0]+\\d{4}");
解決了這個問題。
您從網站獲取的數據還包含no-break space
,可以將HTML編寫為 
顯然,它不屬於\\\\s
類。 要解決此問題,請用[\\\\s\ ]
替換每個\\\\s
以包含此字符(使用Unicode標識符編寫)。
所以你的正則表達式看起來像
Pattern pattern = Pattern
.compile("(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)(\\*){0,2}[\\s\u00A0]+\\d{1,2}[\\s\u00A0]+(January|February|March|April|May|June|July|August|September|October|November|December)[\\s\u00A0]+\\d{4}");
好吧,我可以肯定地說您會錯過最后兩個,因為您的迭代會縮短兩個元素,即更改:
k < el.size() - 2
至
k < el.size()
或更妙的是,使用foreach:
for (String s : el) {
Matcher matcher = pattern.matcher(a);
// ...
}
您的正則表達式看起來還可以。
您的for循環for(int k = 0; k < el.size() - 2; k++)
僅限於el.size() - 2
嘗試刪除-2以遍歷列表中的所有元素
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.