[英]Java Jsoup - Element isn't removed from Elements
I will start from beginning, there's html with pattern like this: 我将从头开始,这里有带有以下模式的html:
<div id="post_message_(some numeric id)">
<div style="some style things">
<div class="smallfont" style="some style">useless text</div>
<table cellpading="6" cellspaceing=.......> a lot of text inside i dont need</table>
</div>
Text i need
</div>
those div's with styles and that table is optional, sometimes there's just 那些具有样式的div,并且该表是可选的,有时
<div id="post">
Text i need
</div>
And i want to parse that text to String. 我想将文本解析为String。 Here;s the code I'm using 这是我正在使用的代码
Elements divsInside = element.getElementById("post_message_" + id).getElementsByTag("div");
for(Element div : divsInside) {
if(div != null && div.attr("style").equals("margin:20px; margin-top:5px; ")) {
System.out.println(div.html());
div.remove();
System.out.println("div removed");
}
}
I added those print lines to check if it finds them and yes, it does find correct ones, but later when I'm parsing it to String: 我添加了这些打印行以检查是否找到它们,是的,它确实找到了正确的行,但是稍后当我将其解析为String时:
String message = Jsoup.parse(divsInside.html().replaceAll("(?i)<br[^>]*>", "br2n")).text()
.replaceAll("br2n", "\n");
String contains all that removed stuff again for some reasons. 由于某些原因,字符串再次包含所有已删除的内容。
I tried removing them by iterators, or making full for and removing elements by indexes, buut the result is the same. 我尝试通过迭代器删除它们,或通过索引进行充分填充并删除元素,但结果却是相同的。
So you want to get Text i need
. 所以你想得到Text i need
。 Use Element
's ownText()
method which Gets the text owned by this element only; does not get the combined text of all children
使用Element
的ownText()
方法, Gets the text owned by this element only; does not get the combined text of all children
方法仅Gets the text owned by this element only; does not get the combined text of all children
Gets the text owned by this element only; does not get the combined text of all children
. Gets the text owned by this element only; does not get the combined text of all children
。
private static void test(String htmlFile) {
File input = null;
Document doc = null;
Element specificIdDiv = null;
try {
input = new File(htmlFile);
doc = Jsoup.parse(input, "ASCII", "");
doc.outputSettings().charset("ASCII");
doc.outputSettings().escapeMode(EscapeMode.base);
/** Get Element id = post_message_1 **/
specificIdDiv = doc.getElementById("post_message_1");
if (specificIdDiv != null ) {
System.out.println("content: " + specificIdDiv.ownText());
}
} catch (Exception e) {
e.printStackTrace();
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.