[英]Extract innerHtml out of body tag using jsoup
I am parsing html using jsoup and want to extract innerHtml inside of body tag 我正在使用jsoup解析html,并想在body标签内提取innerHtml
so far I tried and use document.body.childern().outerHtml; 到目前为止,我尝试并使用document.body.childern()。outerHtml; but its giving only html element and skipping floating text(not wrapped within any html tag) inside of body 但它只给出html元素,并在体内跳过浮动文本(未包装在任何html标签中)
private String getBodyTag(final Document document) {
return document.body().children().outerHtml();
}
Input: 输入:
<!DOCTYPE html>
<html lang="de">
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link rel="stylesheet" type="text/css" href="assets/style.css">
</head>
<body>
<div>questions to improve formatting and clarity.</div>
<h3>Guided Mode</h3>
some sample raw/floating text
</body>
</html>
Expected: 预期:
<div>questions to improve formatting and clarity.</div>
<h3>Guided Mode</h3>
some sample raw/floating text
Actual: 实际:
<div>questions to improve formatting and clarity.</div>
<h3>Guided Mode</h3>
Please use this: 请使用此:
private String getBodyTag(final Document document) {
return document.body().html();
}
You could try returning document.body.innerHtml;
您可以尝试返回document.body.innerHtml;
instead, so it would return everything inside the body tag, including the text outside any tag. 相反,它将返回body标记内的所有内容,包括任何标记外的文本。
As far as I know, the way you are trying to accomplish it is not working because the "raw text" is not considered a child. 据我所知,您尝试完成此操作的方式无效,因为“原始文本”不被视为儿童。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.