简体   繁体   English

将HTML解析为纯文本,但保留每个字符的标签信息

[英]Parsing HTML to plain text, but retaining tag information per character

I'm looking for a method to parse HTML (or Markdown, but I can convert that into HTML) into plain text, but then identify which tags apply for each character in turn. 我正在寻找一种将HTML(或Markdown,但我可以将其转换为HTML)解析为纯文本的方法,然后依次确定哪些标签适用于每个字符。

So, for instance, if I had the following HTML: 因此,例如,如果我有以下HTML:

<p>Hello <em>world</em>!</p>

I would be get the plain text: 我将得到纯文本:

Hello world!

And be able to query different characters to find out which tags apply: 并能够查询不同的字符以找出适用的标签:

Character 0 -> H -> p
Character 1 -> e -> p
Character 2 -> l -> p
...
Character 6 -> w -> p, em
Character 7 -> o -> p, em
...
Character 11 -> ! -> p

Can anyone suggest a method of doing this? 有人可以建议这样做的方法吗? It sounds like it shouldn't be too difficult, so I suspect I'm just searching for the wrong terminology to find something appropriate. 听起来应该不太困难,所以我怀疑我只是在寻找错误的术语以找到合适的东西。

Ideally this would be using JSoup or something similar, but happy to take other approaches and libraries if they work! 理想情况下,这将使用JSoup或类似方法,但很高兴采用其他方法和库(如果可行)!

UPDATE: Also, I need to be able to separate adjacent identical tags. 更新:另外,我需要能够分离相邻的相同标签。 So for the HTML: 因此,对于HTML:

<p>Hello</p><p>World</p>

I would be able to identify p#1 and p#2 . 我将能够识别p#1p#2

@Test
public void testCharMapping() {
    charMapping("<p>Hello <em>world</em>!</p>");
    charMapping("<p>Hello</p><p>World</p>");
}

private void charMapping(String html) {
    System.out.println("char mapping for : " + html);
    for (Element e : Jsoup.parse(html).select("*")) {
        if (e.ownText() != null && !e.ownText().isEmpty())
            for (char c : e.ownText().toCharArray())
                System.out.println(c + " -> " + e.cssSelector());
    }
    System.out.println("====================");
}

Respone: 输入反应:

char mapping for : <p>Hello <em>world</em>!</p>
H -> html > body > p
e -> html > body > p
l -> html > body > p
l -> html > body > p
o -> html > body > p
  -> html > body > p
! -> html > body > p
w -> html > body > p > em
o -> html > body > p > em
r -> html > body > p > em
l -> html > body > p > em
d -> html > body > p > em
====================
char mapping for : <p>Hello</p><p>World</p>
H -> html > body > p:nth-child(1)
e -> html > body > p:nth-child(1)
l -> html > body > p:nth-child(1)
l -> html > body > p:nth-child(1)
o -> html > body > p:nth-child(1)
W -> html > body > p:nth-child(2)
o -> html > body > p:nth-child(2)
r -> html > body > p:nth-child(2)
l -> html > body > p:nth-child(2)
d -> html > body > p:nth-child(2)
====================

您可以对html标签和内部字符串进行基于堆栈的解析,可以在从堆栈中弹出元素时存储位置。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM