[英]Java regex, capturing malformed html
我需要解析超鏈接latest-all.json.bz2
之后的日期,它是29-Oct-2019 15:36
,它來自這個網站: https://dumps.wikimedia.org/wikidatawiki/entities/
如果您查看網站源 HTML:
<a href="latest-all.json.bz2">latest-all.json.bz2</a> 29-Oct-2019 15:36 42621256074
<a href="latest-all.json.gz">latest-all.json.gz</a> 29-Oct-2019 11:51 63776436005
<a href="latest-all.nt.bz2">latest-all.nt.bz2</a> 30-Oct-2019 22:46 84032013058
<a href="latest-all.nt.gz">latest-all.nt.gz</a> 30-Oct-2019 13:12 108976436346
<a href="latest-all.ttl.bz2">latest-all.ttl.bz2</a> 30-Oct-2019 15:43 52462636586
你會看到沒有與之關聯的標簽,所以我無法用 Jsoup 捕獲它。 相反,我嘗試使用這個 reg ex:
String html = this.doc.html();
String patternString = "(latest-all.json.gz<\/a>)(.*)";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(html);
System.out.println(matcher.group(0));
但它不捕獲日期。 有人可以建議我匹配所需日期的正則表達式嗎?
編輯:也嘗試過(latest[-]all[.]json[.]bz2</a>)[ ]*(.*)
但不起作用
查看您當前的正則表達式:
String patternString = "(latest-all\\.ttl\\.gz<\\/a>)(.*)";
這匹配latest-all.ttl.gz<\/a>
形式的字符串,后跟一些東西,我不相信這是你想要的。
首先,在您共享的源 HTML 中,沒有出現任何“latest-all.ttl.gz”(我相信您的意思是尋找“latest-all.json.bz2”)。 其次,正斜杠在正則表達式中不需要 escaping。
因此,考慮到這一點,應該解決問題的正則表達式是:
String patternString = "(latest-all\\.json\\.bz2</a>)[\\n]*(.*)";
(我添加了[\\n]*
部分以排除<a>
標記和您的日期之間的任何新行)。
Package Java 文檔文檔, JAR 文件。 注意:嘗試檢索 HTML 時,以下 Wikipedia 不在線。 改為使用粘貼到 SO 帖子的文件。
package so.y2019.Nov.q003;
import Torello.HTML.*;
import Torello.HTML.NodeSearch.*;
import Torello.Java.FileRW;
import java.util.*;
import java.io.IOException;
import java.net.URL;
import java.util.regex.*;
public class Wiki
{
// Standard Date-Extraction Regular Expression
private static final Pattern DATE_REG_EX = Pattern.compile("\\s+(\\d\\d-[A-Za-z]{3,10}-\\d\\d\\d\\d)\\s+");
public static void main(String[] argv) throws IOException
{
// The Wikipedia page is unavailable. The data was saved to a flat file named "data.html"
String html = FileRW.loadFileToString("so/y2019/Nov/q003/data.html");
// Conver the S.O. Posted Question Sample HTML into an HTML Vector - has: TagNode, TextNode, CommentNode
Vector<HTMLNode> page = HTMLPage.getPageTokens(html, false);
// Iterate over the anchor elements in the HMTL. Could specify with a certain classname, or ID.
// Don't need to specify for this example, since there are only 5 anchor elements.
HNLIInclusive iter = TagNodeInclusiveIterator.iter(page, "a");
while (iter.hasNext())
{
// Next <A HREF=...> ... </A> set found by the iterator in the HTML Vector
DotPair dp = iter.nextDotPair();
// Look for a TextNode that matches the Date-Expression RegEx. Start looking at the ending
// position of the <A HREF=...> ... </A> anchor "Dot Pair Set" '-1' means continue the search
// until the end of the HTML Vector.
TextNode txn = TextNodeGet.first(page, dp.end, -1, DATE_REG_EX);
// Get the TextNode, and match the regular expression to get the "Date String"
Matcher m = DATE_REG_EX.matcher(txn.str);
String dateStr = m.find() ? m.group(1) : "Not Found";
Date d = new Date(dateStr); // Deprecated, but still used.
// Get the opening HTML anchor element ("<A HREF=...>") to get it's href / file-download information.
TagNode anchor = (TagNode) page.elementAt(dp.start);
String hrefFileName = anchor.AV("href");
System.out.println("For Download: [" + hrefFileName + "],\t\t" + "The Date Is: " + d.toString());
}
}
}
上面的代碼會將此文本打印到 UNIX 終端:
@cloudshell:~$ java so.y2019.Nov.q003.Wiki
For Anchor Class: latest-all.json.bz2, The Date Is:Tue Oct 29 00:00:00 CDT 2019
For Anchor Class: latest-all.json.gz, The Date Is:Tue Oct 29 00:00:00 CDT 2019
For Anchor Class: latest-all.nt.bz2, The Date Is:Wed Oct 30 00:00:00 CDT 2019
For Anchor Class: latest-all.nt.gz, The Date Is:Wed Oct 30 00:00:00 CDT 2019
For Anchor Class: latest-all.ttl.bz2, The Date Is:Wed Oct 30 00:00:00 CDT 2019
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.