如何使用HtmlUnit从网页中提取没有HTML标记的文本？

Question

I'm just getting started with HTMLUnit and what I'm looking to do is take a webpage and extract out the raw text from it minus all the html markup. 我刚开始使用HTMLUnit，我想做的是获取一个网页，并从中提取原始文本减去所有html标记。

Can htmlunit accomplish that? htmlunit可以做到吗？ If so, how? 如果是这样，怎么办？ Or is there another library I should be looking at? 还是我应该看看另一个图书馆？

for example if the page contains 例如，如果页面包含

<body><p>para1 test info</p><div><p>more stuff here</p></div>

I'd like it to output 我想要输出

para1 test info more stuff here

thanks 谢谢

Answer 1

http://htmlunit.sourceforge.net/gettingStarted.html indicates that this is indeed possible. http://htmlunit.sourceforge.net/gettingStarted.html表示确实可行。

@Test
public void homePage() throws Exception {
    final WebClient webClient = new WebClient();
    final HtmlPage page = webClient.getPage("http://htmlunit.sourceforge.net");
    assertEquals("HtmlUnit - Welcome to HtmlUnit", page.getTitleText());

    final String pageAsXml = page.asXml();
    assertTrue(pageAsXml.contains("<body class=\"composite\">"));

    final String pageAsText = page.asText();
    assertTrue(pageAsText.contains("Support for the HTTP and HTTPS protocols"));
}

NB: the page.asText() command seems to offer exactly what you are after. 注意： page.asText（）命令似乎提供了您想要的功能。

Javadoc for asText (Inherited from DomNode to HtmlPage) 用于asText的Javadoc （继承自DomNode到HtmlPage）

如何使用HtmlUnit从网页中提取没有HTML标记的文本？

问题描述

1 个解决方案

解决方案1
5 已采纳 2010-07-07 05:15:10

如何使用HtmlUnit从网页中提取没有HTML标记的文本？

问题描述

1 个解决方案

解决方案1 5 已采纳 2010-07-07 05:15:10

解决方案1
5 已采纳 2010-07-07 05:15:10