简体   繁体   English

jsoup标签提取问题

[英]jsoup tag extraction problem


test: example 测试:示例
test1:example1 test1:example1
  Elements size = doc.select("div:contains(test:)"); 

how can i extract the value example and example1 from this html tag....using jsoup.. 我如何从这个html标记中提取值example和example1。...使用jsoup ..

Since this HTML is not semantic enough for the final purpose you have (a <br> cannot have children and : is not HTML), you can't do much with a HTML parser like Jsoup. 由于此HTML的语义不足以实现您的最终目的( <br>不能有子对象,而:不是HTML),因此您无法使用Jsoup之类的HTML解析器做很多事情。 A HTML parser isn't intented to do the job of specific text extraction/tokenizing. HTML解析器无意执行特定的文本提取/标记化工作。

Best what you can do is to get the HTML content of the <div> using Jsoup and then extract that further using the usual java.lang.String or maybe java.util.Scanner methods. 最好的办法是使用Jsoup获取<div>的HTML内容,然后使用常规的java.lang.Stringjava.util.Scanner方法进一步提取该内容。

Here's a kickoff example: 这是一个启动示例:

String html = "<div style=\"height:240px;\"><br>test: example<br>test1:example1</div>";
Document document = Jsoup.parse(html);
Element div = document.select("div[style=height:240px;]").first();

String[] parts = div.html().split("<br />"); // Jsoup transforms <br> to <br />.
for (String part : parts) {
    int colon = part.indexOf(':');
    if (colon > -1) {
        System.out.println(part.substring(colon + 1).trim());
    }
}

This results in 这导致

example
example1

If I was the HTML author, I would have used a definition list for this. 如果我是HTML作者,那么我将为此使用定义列表 Eg 例如

<dl id="mydl">
     <dt>test:</dt><dd>example</dd>
     <dt>test1:</dt><dd>example1</dd>
</dl>

This is more semantic and thus more easy parseable: 这更具语义,因此更易于解析:

String html = "<dl id=\"mydl\"><dt>test:</dt><dd>example</dd><dt>test1:</dt><dd>example1</dd></dl>";
Document document = Jsoup.parse(html);
Elements dts = document.select("#mydl dd");
for (Element dt : dts) {
    System.out.println(dt.text());
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM