如何使用Java正则表达式过滤html文件内容？

Question

With java I can download a webpage into the txt file.Now I want to read values from this txt file by regular expression.The below is the small part of the real html. 使用Java我可以将网页下载到txt文件中。现在，我想通过正则表达式从该txt文件中读取值。以下是实际html的一小部分。

<div>
   <input id="_NAME" value="/John/" />
   <input id="_LASTNAME" value="/BOND/"/>
   <input id="_Class"   value="5" />
</div>

I just want to read values according to id (_Name and _LASTNAME)? 我只想根据ID（_Name和_LASTNAME）读取值？ Thanks in advance 提前致谢

Answer 1

As long as the HTML file is usable when browsing, it should be parsable by Jsoup. 只要在浏览时可以使用HTML文件，Jsoup就可以对其进行解析。 Since you are only querying attributes of input element, you don't really have to worry about the structure of the resulting DOM. 由于您仅查询input元素的属性，因此您实际上不必担心结果DOM的结构。

Sample code, using your example of HTML, with a bunch of bad HTML tag in front: 使用您的HTML示例的示例代码，前面带有一堆错误的HTML标签：

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;


public class SO27938637 {
    public static void main(String[] args) {
        Document doc = Jsoup.parse("<div><span><div><b>sdf</span>dsf<i>sdfdsfsdfds<span></div><div>\n    <input id=\"_NAME\" value=\"/John/\" />\n   <input id=\"_LASTNAME\" value=\"/BOND/\"/>\n   <input id=\"_Class\"   value=\"5\" /></div>");
        Elements inputElement = doc.select("input");

        for (Element e: inputElement) {
            System.out.println(e.attr("id") + ": " + e.attr("value"));
        }
    }
}

Output: 输出：

_NAME: /John/
_LASTNAME: /BOND/
_Class: 5

如何使用Java正则表达式过滤html文件内容？

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-01-14 08:50:15

如何使用Java正则表达式过滤html文件内容？

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-01-14 08:50:15

解决方案1
1 已采纳 2015-01-14 08:50:15