简体   繁体   English

使用Jsoup 1.11解析XHTML

[英]Parsing XHTML with Jsoup 1.11

I am trying to parse a XHTML file with Jsoup and its stripping the closing slash on some of my tags. 我正在尝试使用Jsoup解析XHTML文件,并剥离一些标签上的斜杠。 ie: 即:

<link rel="stylesheet" type="text/css" href="/css/assessment.css" />

becomes 变成

<link rel="stylesheet" type="text/css" href="/css/assessment.css">

I have tried some of the other answers here: 我在这里尝试了其他一些答案:

Jsoup: How to convert a String containing HTML to a XHTML document? Jsoup:如何将包含HTML的字符串转换为XHTML文档? https://github.com/jhy/jsoup/issues/511 jsoup: differnt result after updating from 1.7.3 to 1.8.1, how to avoid this? https://github.com/jhy/jsoup/issues/511 jsoup:从1.7.3升级到1.8.1后的结果不同,如何避免这种情况?

With my latest attempt being: 我最近的尝试是:

    File input = new File("src\\main\\resources\\templates\\assessmenttemplate.html");
    Document doc = Jsoup.parse(input, "UTF-8", "");
    doc.outputSettings().escapeMode(Entities.EscapeMode.xhtml);
    doc.outputSettings().charset("UTF-8")

I also tried to change the doctype: 我也尝试更改doctype:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

But the problem persists. 但是问题仍然存在。 How to parse HTML without stripping the trailing slashes? 如何解析HTML而不删除斜杠?

This worked: 这工作:

    File input = new File("src\\main\\resources\\templates\\assessmenttemplate.html");
    Document doc = Jsoup.parse(input, "UTF-8", "");
    doc.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
    doc.outputSettings().escapeMode(Entities.EscapeMode.xhtml);
    doc.outputSettings().charset("UTF-8");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM