简体   繁体   中英

Parsing HTML to plain text, but retaining tag information per character

I'm looking for a method to parse HTML (or Markdown, but I can convert that into HTML) into plain text, but then identify which tags apply for each character in turn.

So, for instance, if I had the following HTML:

<p>Hello <em>world</em>!</p>

I would be get the plain text:

Hello world!

And be able to query different characters to find out which tags apply:

Character 0 -> H -> p
Character 1 -> e -> p
Character 2 -> l -> p
...
Character 6 -> w -> p, em
Character 7 -> o -> p, em
...
Character 11 -> ! -> p

Can anyone suggest a method of doing this? It sounds like it shouldn't be too difficult, so I suspect I'm just searching for the wrong terminology to find something appropriate.

Ideally this would be using JSoup or something similar, but happy to take other approaches and libraries if they work!

UPDATE: Also, I need to be able to separate adjacent identical tags. So for the HTML:

<p>Hello</p><p>World</p>

I would be able to identify p#1 and p#2 .

@Test
public void testCharMapping() {
    charMapping("<p>Hello <em>world</em>!</p>");
    charMapping("<p>Hello</p><p>World</p>");
}

private void charMapping(String html) {
    System.out.println("char mapping for : " + html);
    for (Element e : Jsoup.parse(html).select("*")) {
        if (e.ownText() != null && !e.ownText().isEmpty())
            for (char c : e.ownText().toCharArray())
                System.out.println(c + " -> " + e.cssSelector());
    }
    System.out.println("====================");
}

Respone:

char mapping for : <p>Hello <em>world</em>!</p>
H -> html > body > p
e -> html > body > p
l -> html > body > p
l -> html > body > p
o -> html > body > p
  -> html > body > p
! -> html > body > p
w -> html > body > p > em
o -> html > body > p > em
r -> html > body > p > em
l -> html > body > p > em
d -> html > body > p > em
====================
char mapping for : <p>Hello</p><p>World</p>
H -> html > body > p:nth-child(1)
e -> html > body > p:nth-child(1)
l -> html > body > p:nth-child(1)
l -> html > body > p:nth-child(1)
o -> html > body > p:nth-child(1)
W -> html > body > p:nth-child(2)
o -> html > body > p:nth-child(2)
r -> html > body > p:nth-child(2)
l -> html > body > p:nth-child(2)
d -> html > body > p:nth-child(2)
====================

您可以对html标签和内部字符串进行基于堆栈的解析,可以在从堆栈中弹出元素时存储位置。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM