简体   繁体   English

在android中使用jsoup删除html实体

[英]Remove html entities with jsoup in android

I use jsoup to scrape HTML.我使用 jsoup 来抓取 HTML。 I am having problems with extracting information from html tags of the following kind:我在从以下类型的 html 标签中提取信息时遇到问题:

<span class="some">&#8237;&#8237;78&#8236;&#8236;</span>

it should only be like它应该只是像

<span class="some">78‬‬</span>

How can I remove the HTML Entities from the string?如何从字符串中删除 HTML 实体?

I'm not familiar with jsoup, but if it a "normal" HTML DOM Parser that returns a "standard" HTML DOM, then what you want is not really possible. 我对jsoup并不熟悉,但是如果它是一个返回“标准” HTML DOM的“标准” HTML DOM分析器,那么您真正想要的是不可能的。 The problem is that once the DOM has been built it can't distinguish between characters that are encoded normally and one expressed as an entity anymore. 问题在于,一旦构建了DOM,就无法区分正常编码的字符和表示为实体的字符。

For example: <span>A</span> and <span>&#65;</span> are considered completely identical and can't be distinguished once in the DOM - both are span elements containing a text node with text A . 例如: <span>A</span><span>&#65;</span>被认为是完全相同的,并且在DOM中一次也无法区分-都是包含文本节点为text A span元素。

So what you can do is loop over all text nodes and search an replace these characters (not the entities): 因此,您可以做的是遍历所有文本节点并搜索替换这些字符(而不是实体)的字符:

void removeInvalidChars(Element element) {
  for (Node child : element.childNodes()) {
    if (child instanceof TextNode) {
      TextNode textNode = (TexNode) child;
      textNode.text( textNode.text().replaceAll("\u202C", "").replaceAll("\u202D", "") );
      // 202C and 202D are the hex codes for the decimal values 8236 and 8237
    } else if (child instanceof Element) {
       removeInvalidChars((Element) child);
    }
  }
}

If you need to distinguish between raw characters and entities, then you'll need to use a different non-DOM (eg event-based) HTML parser. 如果需要区分原始字符和实体,则需要使用其他非DOM(例如基于事件)的HTML解析器。

http://jsoup.org/apidocs/org/jsoup/select/Elements.html http://jsoup.org/apidocs/org/jsoup/select/Elements.html

Press Control+F and look after "remove". 按Control + F并照看“删除”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM