[英]Extract html part from string (include plain text and html part) using java
I have an inputstream from email which can be converted to string like this:我有一个来自 email 的输入流,可以像这样转换为字符串:
String content = "Hello world!\n"+
"Thank you!\n"+
"\n"+
"<html>\n" +
"<head>\n" +
"\t<meta id=\"leadId\" name=\"leadId\" content=\"6778130\"/>\n" +
"\t<title>testing</title>\n" +
"</head>\n" +
"<body>\n" +
"\t<span>testing - 20200727</span>\n" +
"</body>\n" +
"</html>"+
"\n" +
"Have a good day!";
I wanna extract HTML part from this string, the result I expect is like:我想从这个字符串中提取 HTML 部分,我期望的结果是这样的:
<html>
<head>
<meta id="leadId" name="leadId" content="6778130"/>
<titletesting</title>
</head>
<body>
<span>testing - 20200727</span>
</body>
</html>
I tried Jsoup before, but it didn't work for me.我之前尝试过 Jsoup,但它对我不起作用。 Does anyone know other solutions to it?
有谁知道其他解决方案吗? Can I use javax.mail for it (the inputstream itself)?
我可以为它使用 javax.mail(输入流本身)吗? If so, how can I do that?
如果是这样,我该怎么做? Could you provide an example?
你能举个例子吗?
My approach - use a regex to extract the bit of text you're interested in.我的方法 - 使用正则表达式来提取您感兴趣的文本。
https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Pattern p = Pattern.compile("<html>.*</html>");
Matcher m = p.matcher(inputString);
String html = m.group();
You could then use JSoup.parse(html);
然后你可以使用
JSoup.parse(html);
to parse the html and navigate the elements.解析 html 并导航元素。 (Or us HtmlUnit if you want to navigate the document using XPaths).
(如果您想使用 XPath 导航文档,或者使用 HtmlUnit)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.