简体   繁体   English

如何仅从根元素中提取文本内容 - java, com.gargoylesoftware.htmlunit.html

[英]How can I extract text content only from root element - java, com.gargoylesoftware.htmlunit.html

I can't find any way to extract text content only from the root element using com.gargoylesoftware.htmlunit.html .我找不到任何使用com.gargoylesoftware.htmlunit.html 仅从根元素提取文本内容的方法。 Here is some example:下面是一些例子:

<td>
  W 03:10 PM-04:25 PM
  <strong>
     <br>
     Hybrid (50%+ in-person)
  </strong>
</td>

I want to extract the text content from the root element("td" in this case), but it also extract the text content from the child element, which is the part that I don't want:我想从根元素中提取文本内容(在这种情况下为“td”),但它也从子元素中提取文本内容,这是我不想要的部分:

private void extractTextContent(HtmlElement htmlElement) {
    String content = htmlElement.getTextContent();
    System.out.println(content);
}

output:输出:

W 03:10 PM-04:25 PMHybrid (50%+ in-person)

desired output:所需的输出:

W 03:10 PM-04:25 PM

I've tried to use other method call "asText()", however that doesn't give me desired output.我尝试使用其他方法调用“asText()”,但这并没有给我想要的输出。 I couldn't find any people who has same question using com.gargoylesoftware.htmlunit.html .我找不到任何使用com.gargoylesoftware.htmlunit.html有相同问题的人。 Is there any way/method that would extract text content only from the root element?有什么方法/方法可以仅从根元素中提取文本内容吗?

EDIT: Thank you for the answer.编辑:谢谢你的回答。 I used same idea of deleting child node to get my desired output.我使用相同的删除子节点的想法来获得我想要的输出。 Here is the syntax for java:这是java的语法:

private void extractTextContent(HtmlElement htmlElement) {
    DomNode child = htmlElement.getLastElementChild();
    String tagname = "";
    if(child != null) {
        tagname = child.getTextContent();
        htmlElement.removeChild(tagname, 0);
    }
    String content = htmlElement.getTextContent();
}

You can try removing child nodes before fetching textContent.您可以在获取 textContent 之前尝试删除子节点。

private void extractTextContent(HtmlElement htmlElement) {
    DomNode child = htmlElement.getLastElementChild();
    String tagname = "";
    if(child != null) {
        tagname = child.getTextContent();
        htmlElement.removeChild(tagname, 0);
    }
    String content = htmlElement.getTextContent();
}

I have edited my answer with Java Syntax provided by @XYZ我用@XYZ 提供的 Java 语法编辑了我的答案

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 HtmlUnit com.gargoylesoftware.htmlunit.DefaultCssErrorHandler错误 - HtmlUnit com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error 我可以使用java HtmlUnit库从linkedIn中提取信息吗? - Can I extract information from linkedIn using java HtmlUnit library? IllegalStateException:在getTitle()时,无法使用HtmlUnitDriver通过com.gargoylesoftware.htmlunit.UnexpectedPage的名称来查找元素 - IllegalStateException: Unable to locate element by name for com.gargoylesoftware.htmlunit.UnexpectedPage with HtmlUnitDriver while getTitle() 为什么 javac 找不到我正在尝试导入的 com.gargoylesoftware.htmlunit package - Why isn't javac finding the com.gargoylesoftware.htmlunit package I am trying to import htmlunit java - 如何解析来自 javascript 的内容结果? 和 htmlunit 错误 - htmlunit java - How to parse a content results from javascript? and a htmlunit error 如何仅从HTML页面中提取主要文本内容? - How can I extract only the main textual content from an HTML page? 如何从 html 代码中提取 web 应用程序内容? - How can I extract web app content from html code? 如何仅从 HTML 文档中提取粗体文本? - How do I extract only the bold text from an HTML document? 在Java代码中,如何提取随机html页面的文本? - In Java code, how can I extract text of a random html page? 如何使用HtmlUnit从网页中提取没有HTML标记的文本? - How to extract the text without HTML tags out of a webpage using HtmlUnit?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM