使用 java 从字符串中提取 html 部分（包括纯文本和 html 部分）

Question

I have an inputstream from email which can be converted to string like this:我有一个来自 email 的输入流，可以像这样转换为字符串：

String content = "Hello world!\n"+
                 "Thank you!\n"+
                 "\n"+
                 "<html>\n" +
                 "<head>\n" +
                 "\t<meta id=\"leadId\" name=\"leadId\" content=\"6778130\"/>\n" +
                 "\t<title>testing</title>\n" +
                 "</head>\n" +
                 "<body>\n" +
                 "\t<span>testing - 20200727</span>\n" +
                 "</body>\n" +
                 "</html>"+
                 "\n" + 
                 "Have a good day!";

I wanna extract HTML part from this string, the result I expect is like:我想从这个字符串中提取 HTML 部分，我期望的结果是这样的：

<html>
<head>
    <meta id="leadId" name="leadId" content="6778130"/>
    <titletesting</title>
</head>
<body>
    <span>testing - 20200727</span>
</body>
</html>

I tried Jsoup before, but it didn't work for me.我之前尝试过 Jsoup，但它对我不起作用。 Does anyone know other solutions to it?有谁知道其他解决方案吗？ Can I use javax.mail for it (the inputstream itself)?我可以为它使用 javax.mail（输入流本身）吗？ If so, how can I do that?如果是这样，我该怎么做？ Could you provide an example?你能举个例子吗？

Answer 1

My approach - use a regex to extract the bit of text you're interested in.我的方法 - 使用正则表达式来提取您感兴趣的文本。

https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

 Pattern p = Pattern.compile("<html>.*</html>");
 Matcher m = p.matcher(inputString);
 String html = m.group();

You could then use JSoup.parse(html);然后你可以使用JSoup.parse(html); to parse the html and navigate the elements.解析 html 并导航元素。 (Or us HtmlUnit if you want to navigate the document using XPaths). （如果您想使用 XPath 导航文档，或者使用 HtmlUnit）。

使用 java 从字符串中提取 html 部分（包括纯文本和 html 部分）

问题描述

1 个解决方案

解决方案1
0 2020-07-31 11:12:02

使用 java 从字符串中提取 html 部分（包括纯文本和 html 部分）

问题描述

1 个解决方案

解决方案1 0 2020-07-31 11:12:02

解决方案1
0 2020-07-31 11:12:02