Java 正則表達式，捕獲格式錯誤的 html

Question

我需要解析超鏈接latest-all.json.bz2之后的日期，它是29-Oct-2019 15:36 ，它來自這個網站： https://dumps.wikimedia.org/wikidatawiki/entities/

如果您查看網站源 HTML：

<a href="latest-all.json.bz2">latest-all.json.bz2</a>                                29-Oct-2019 15:36         42621256074
<a href="latest-all.json.gz">latest-all.json.gz</a>                                 29-Oct-2019 11:51         63776436005
<a href="latest-all.nt.bz2">latest-all.nt.bz2</a>                                  30-Oct-2019 22:46         84032013058
<a href="latest-all.nt.gz">latest-all.nt.gz</a>                                   30-Oct-2019 13:12        108976436346
<a href="latest-all.ttl.bz2">latest-all.ttl.bz2</a>                                 30-Oct-2019 15:43         52462636586

你會看到沒有與之關聯的標簽，所以我無法用 Jsoup 捕獲它。 相反，我嘗試使用這個 reg ex：

String html = this.doc.html();
        String patternString = "(latest-all.json.gz<\/a>)(.*)";
        Pattern pattern = Pattern.compile(patternString);
        Matcher matcher = pattern.matcher(html);
        System.out.println(matcher.group(0));

但它不捕獲日期。 有人可以建議我匹配所需日期的正則表達式嗎？

編輯：也嘗試過(latest[-]all[.]json[.]bz2</a>)[ ]*(.*)但不起作用

Answer 1

查看您當前的正則表達式：

String patternString = "(latest-all\\.ttl\\.gz<\\/a>)(.*)";

這匹配latest-all.ttl.gz<\/a>形式的字符串，后跟一些東西，我不相信這是你想要的。

首先，在您共享的源 HTML 中，沒有出現任何“latest-all.ttl.gz”（我相信您的意思是尋找“latest-all.json.bz2”）。 其次，正斜杠在正則表達式中不需要 escaping。

因此，考慮到這一點，應該解決問題的正則表達式是：

String patternString = "(latest-all\\.json\\.bz2</a>)[\\n]*(.*)";

（我添加了[\\n]*部分以排除<a>標記和您的日期之間的任何新行）。

Answer 2

您可以使用這樣的正則表達式：

\S+ \d{2}:\d{2}

工作演示

Answer 3

Package Java 文檔文檔， JAR 文件。 注意：嘗試檢索 HTML 時，以下 Wikipedia 不在線。 改為使用粘貼到 SO 帖子的文件。

package so.y2019.Nov.q003;

import Torello.HTML.*;
import Torello.HTML.NodeSearch.*;
import Torello.Java.FileRW;

import java.util.*;
import java.io.IOException;
import java.net.URL;
import java.util.regex.*;

public class Wiki
{
    // Standard Date-Extraction Regular Expression
    private static final Pattern DATE_REG_EX = Pattern.compile("\\s+(\\d\\d-[A-Za-z]{3,10}-\\d\\d\\d\\d)\\s+");

    public static void main(String[] argv) throws IOException
    {
        // The Wikipedia page is unavailable.  The data was saved to a flat file named "data.html"
        String html = FileRW.loadFileToString("so/y2019/Nov/q003/data.html");
        // Conver the S.O. Posted Question Sample HTML into an HTML Vector - has: TagNode, TextNode, CommentNode
        Vector<HTMLNode> page = HTMLPage.getPageTokens(html, false);

        // Iterate over the anchor elements in the HMTL.  Could specify with a certain classname, or ID.
        // Don't need to specify for this example, since there are only 5 anchor elements.
        HNLIInclusive iter = TagNodeInclusiveIterator.iter(page, "a");
        while (iter.hasNext())
        {
            // Next <A HREF=...> ... </A>  set found by the iterator in the HTML Vector
            DotPair dp = iter.nextDotPair();

            // Look for a TextNode that matches the Date-Expression RegEx.  Start looking at the ending
            // position of the <A HREF=...> ... </A> anchor "Dot Pair Set"   '-1' means continue the search
            // until the end of the HTML Vector.
            TextNode txn = TextNodeGet.first(page, dp.end, -1, DATE_REG_EX);

            // Get the TextNode, and match the regular expression to get the "Date String"
            Matcher m = DATE_REG_EX.matcher(txn.str);
            String dateStr = m.find() ? m.group(1) : "Not Found";
            Date d = new Date(dateStr);   // Deprecated, but still used.

            // Get the opening HTML anchor element ("<A HREF=...>") to get it's href / file-download information.
            TagNode anchor = (TagNode) page.elementAt(dp.start);
            String hrefFileName = anchor.AV("href");

            System.out.println("For Download: [" + hrefFileName + "],\t\t" + "The Date Is: " + d.toString());
        }

    }
}

上面的代碼會將此文本打印到 UNIX 終端：

@cloudshell:~$ java so.y2019.Nov.q003.Wiki
For Anchor Class: latest-all.json.bz2,  The Date Is:Tue Oct 29 00:00:00 CDT 2019
For Anchor Class: latest-all.json.gz,   The Date Is:Tue Oct 29 00:00:00 CDT 2019
For Anchor Class: latest-all.nt.bz2,    The Date Is:Wed Oct 30 00:00:00 CDT 2019
For Anchor Class: latest-all.nt.gz,     The Date Is:Wed Oct 30 00:00:00 CDT 2019
For Anchor Class: latest-all.ttl.bz2,   The Date Is:Wed Oct 30 00:00:00 CDT 2019

Java 正則表達式，捕獲格式錯誤的 html

問題描述

2 個解決方案

解決方案1
0 已采納 2019-11-04 19:23:01

解決方案2
0 2019-11-04 19:47:57

解決方案3
0 2019-11-04 23:14:50

Java 正則表達式，捕獲格式錯誤的 html

問題描述

2 個解決方案

解決方案1 0 已采納 2019-11-04 19:23:01

解決方案2 0 2019-11-04 19:47:57

解決方案3 0 2019-11-04 23:14:50

解決方案1
0 已采納 2019-11-04 19:23:01

解決方案2
0 2019-11-04 19:47:57

解決方案3
0 2019-11-04 23:14:50