[英]Java Jsoup - Element isn't removed from Elements
我將從頭開始,這里有帶有以下模式的html:
<div id="post_message_(some numeric id)">
<div style="some style things">
<div class="smallfont" style="some style">useless text</div>
<table cellpading="6" cellspaceing=.......> a lot of text inside i dont need</table>
</div>
Text i need
</div>
那些具有樣式的div,並且該表是可選的,有時
<div id="post">
Text i need
</div>
我想將文本解析為String。 這是我正在使用的代碼
Elements divsInside = element.getElementById("post_message_" + id).getElementsByTag("div");
for(Element div : divsInside) {
if(div != null && div.attr("style").equals("margin:20px; margin-top:5px; ")) {
System.out.println(div.html());
div.remove();
System.out.println("div removed");
}
}
我添加了這些打印行以檢查是否找到它們,是的,它確實找到了正確的行,但是稍后當我將其解析為String時:
String message = Jsoup.parse(divsInside.html().replaceAll("(?i)<br[^>]*>", "br2n")).text()
.replaceAll("br2n", "\n");
由於某些原因,字符串再次包含所有已刪除的內容。
我嘗試通過迭代器刪除它們,或通過索引進行充分填充並刪除元素,但結果卻是相同的。
所以你想得到Text i need
。 使用Element
的ownText()
方法, Gets the text owned by this element only; does not get the combined text of all children
方法僅Gets the text owned by this element only; does not get the combined text of all children
Gets the text owned by this element only; does not get the combined text of all children
。
private static void test(String htmlFile) {
File input = null;
Document doc = null;
Element specificIdDiv = null;
try {
input = new File(htmlFile);
doc = Jsoup.parse(input, "ASCII", "");
doc.outputSettings().charset("ASCII");
doc.outputSettings().escapeMode(EscapeMode.base);
/** Get Element id = post_message_1 **/
specificIdDiv = doc.getElementById("post_message_1");
if (specificIdDiv != null ) {
System.out.println("content: " + specificIdDiv.ownText());
}
} catch (Exception e) {
e.printStackTrace();
}
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.