Html 解析器通过保留原始 html 标签和换行符来替换 html 文档中的超链接

Question

I am using Jsoup html parser for replacing hyperlinks in a html document.我正在使用 Jsoup html 解析器来替换 html 文档中的超链接。 I want actual case, elements and line breaks to be as is even after updating the html document.即使在更新 html 文档之后，我也希望实际情况、元素和换行符保持原样。 But, Jsoup is updating the case to lowercase, updating few elements and also removing the line breaks.但是，Jsoup 正在将大小写更新为小写，更新一些元素并删除换行符。 I have tried with ParseSettings also.我也尝试过 ParseSettings。 But with parse settings, doc.select("a[href]") is not returning the elements.但是使用解析设置， doc.select("a[href]")不会返回元素。 Below is the code I am using.下面是我正在使用的代码。

Can someone help me with the right html parser using java to replace hyperlinks by retaining the html document as is?有人可以帮助我使用正确的 html 解析器，使用 java 来通过保留 html 文档来替换超链接吗？

File input = new File(fileEntry.getPath());
Parser parser = Parser.htmlParser();
parser.settings(new ParseSettings(true, true)); 
Document doc = parser.parseInput(input.toString(), "UTF-8");
Elements anchorLinks = doc.select("a[href]");

Answer 1

The documentation is your friend… even when there is no description in that documentation. 文档是您的朋友……即使该文档中没有描述。

Notice the first argument is named html and the second argument is named baseUri .请注意，第一个参数名为html ，第二个参数名为baseUri 。

The first argument needs to be actual HTML content, not a filename.第一个参数需要是实际的 HTML 内容，而不是文件名。 Your code is trying to parse a filename as if it's HTML.您的代码正在尝试解析文件名，就好像它是 HTML 一样。

The second argument needs to be a URI, or an empty string.第二个参数需要是一个 URI，或者一个空字符串。 "UTF-8" is not a valid URI at all, though since you aren't trying to resolve the links, it may not be a critical mistake. “UTF-8”根本不是有效的 URI，但由于您不是在尝试解析链接，因此它可能不是一个严重的错误。

You probably want the Jsoup.parse method which takes both an InputStream and a customized Parser :您可能需要Jsoup.parse 方法，它同时采用 InputStream 和自定义 Parser ：

Document doc;
try (InputStream content = new BufferedInputStream(
    new FileInputStream(input))) {

    doc = Jsoup.parse(content, null, "", parser);
}

Html 解析器通过保留原始 html 标签和换行符来替换 html 文档中的超链接

问题描述

1 个解决方案

解决方案1
0 2020-11-05 12:24:18

Html 解析器通过保留原始 html 标签和换行符来替换 html 文档中的超链接

问题描述

1 个解决方案

解决方案1 0 2020-11-05 12:24:18

解决方案1
0 2020-11-05 12:24:18