[英]How to extract the text without HTML tags out of a webpage using HtmlUnit?
I'm just getting started with HTMLUnit and what I'm looking to do is take a webpage and extract out the raw text from it minus all the html markup. 我刚开始使用HTMLUnit,我想做的是获取一个网页,并从中提取原始文本减去所有html标记。
Can htmlunit accomplish that? htmlunit可以做到吗? If so, how?
如果是这样,怎么办? Or is there another library I should be looking at?
还是我应该看看另一个图书馆?
for example if the page contains 例如,如果页面包含
<body><p>para1 test info</p><div><p>more stuff here</p></div>
I'd like it to output 我想要输出
para1 test info more stuff here
thanks 谢谢
http://htmlunit.sourceforge.net/gettingStarted.html indicates that this is indeed possible. http://htmlunit.sourceforge.net/gettingStarted.html表示确实可行。
@Test
public void homePage() throws Exception {
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("http://htmlunit.sourceforge.net");
assertEquals("HtmlUnit - Welcome to HtmlUnit", page.getTitleText());
final String pageAsXml = page.asXml();
assertTrue(pageAsXml.contains("<body class=\"composite\">"));
final String pageAsText = page.asText();
assertTrue(pageAsText.contains("Support for the HTTP and HTTPS protocols"));
}
NB: the page.asText() command seems to offer exactly what you are after. 注意: page.asText()命令似乎提供了您想要的功能。
Javadoc for asText (Inherited from DomNode to HtmlPage) 用于asText的Javadoc (继承自DomNode到HtmlPage)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.