简体   繁体   English

将PegDown + JSoup输出与PageDown输出匹配

[英]Matching PegDown+JSoup Output to PageDown Output

I am trying to parse and sanitize markdown on the client and server side. 我试图在客户端和服务器端解析并清理markdown。

  • On the client side, I use PageDown as a markdown editor. 在客户端,我使用PageDown作为降价编辑器。 This is exactly what StackOverflow uses, and it comes with a nifty preview box. 这正是StackOverflow使用的,它带有一个漂亮的预览框。 This preview box shows you sanitized html, so it removes things like <div> tags. 此预览框显示已清理的 html,因此它会删除<div>标记之类的内容。

  • On the server side, I'm using PegDown and JSoup to parse and sanitize the markdown. 在服务器端,我正在使用PegDownJSoup来解析和清理markdown。

However, I'm finding cases where the output of the two aren't the same. 但是,我发现两者的输出不一样的情况。 For example: 例如:

Input markdown: how are <div>tags</div> treated? 输入降价: how are <div>tags</div> treated?

PageDown output: <p>how are tags treated?</p> PageDown输出: <p>how are tags treated?</p>

PegDown/JSoup output: PegDown / JSoup输出:

<p>how are </p>tags treated?
<p></p>

I'm not doing anything fancy with JSoup. 我对JSoup没有任何想象力。 Here's my code: 这是我的代码:

public class Main {

    public static void main(String... args){

        PegDownProcessor pdp = new PegDownProcessor();

        String markdown = "how are <div>tags</div> treated?";

        String html = pdp.markdownToHtml(markdown);

        Whitelist whitelist = Whitelist.relaxed().removeTags("div");

        html = Jsoup.clean(html, whitelist);
        System.out.println(html);

        System.out.println("Done.");
    }
}

I understand why this is happening, and I'm not surprised that two different systems generate two different outputs. 我理解为什么会发生这种情况,我不会惊讶于两个不同的系统会产生两种不同的输出。 My question is: how can I setup JSoup so that it simply removes the <div> tags instead of adding extra <p> tags? 我的问题是:如何设置JSoup以便它只删除<div>标签而不是添加额外的<p>标签?

My end goal is to simply have the server-side parsing/sanitizing generate reasonably similar results to the client-side parsing/sanitizing. 我的最终目标是简单地让服务器端解析/清理生成与客户端解析/清理相当类似的结果。 If there are better ways to do that, I'm open to suggestions. 如果有更好的方法,我愿意接受建议。 I don't really care if the outputs of the two are exactly identical, but things like extra <p> tags are going to be very noticeable by users, so I'm trying to eliminate this one major difference. 我真的不在乎两者的输出是否完全相同,但是诸如额外的<p>标签之类的东西会被用户非常明显,所以我试图消除这一主要差异。

Bonus question: is there a list of the html tags and attributes that PageDown can output? 额外问题:是否有PageDown可以输出的html标签和属性列表?

Edit: I've also tried using the OWASP sanitizer , but I get very similar results: the <div> tags are removed, but the <p> tags are "fixed" in the above way, which results in different html than PageDown's sanitizer. 编辑:我也尝试使用OWASP清洁剂 ,但我得到了非常相似的结果: <div>标签被移除,但<p>标签以上述方式“固定”,导致与PageDown的清洁剂不同的html 。

how can I setup JSoup so that it simply removes the <div> tags instead of adding extra <p> tags? 如何设置JSoup以便它只删除<div>标签而不是添加额外的<p>标签?

HTML 5 specifications deny the use of div element inside a p element. HTML 5规范拒绝在p元素中使用div元素。 Jsoup honors those specifications, this why there are two p elements in the final html string. Jsoup尊重这些规范,这就是为什么最终的html字符串中有两个p元素。

To better understand why this happen, let's see how the Jsoup#clean works in three steps: 为了更好地理解为什么会发生这种情况,让我们看看Jsoup#clean如何Jsoup#clean工作:

  1. Parse dirty html 解析脏HTML
  2. Adjust resulting tree to honor HTML 5 specs 调整结果树以符合HTML 5规范
  3. Remove denied tags 删除拒绝的标签

In Step 2, the first <p> tag is closed just before the opening div . 在第2步中,第一个<p>标记在开始div之前关闭。 The second p gets its opening tag too in this same step. 第二个p也在同一步骤中获得了开始标记。 Since Jsoup doesn't know where the legitimate content of this paragraph starts, it limits the content of this second paragraph to the strict amount (ie nothing). 由于Jsoup不知道该段的合法内容从何处开始,因此它将第二段的内容限制为严格的数量(即没有)。

The actions in Step 1 and 2 create a new HTML code satisfying HTML 5 specifications. 步骤1和2中的操作创建了满足HTML 5规范的新HTML代码。 In Step 3, the div can now be removed. 在步骤3中,现在可以删除div

My end goal is to simply have the server-side parsing/sanitizing generate reasonably similar results to the client-side parsing/sanitizing. 我的最终目标是简单地让服务器端解析/清理生成与客户端解析/清理相当类似的结果。

To avoid other cases like the one spotted here, you should use the same system on both client and on server side. 为了避免像这里发现的其他情况,您应该在客户端和服务器端使用相同的系统。 Since Pagedown is written in Javascript, you can try to run it inside a server side Javascript engine. 由于Pagedown是用Javascript编写的,因此您可以尝试在服务器端Javascript引擎中运行它。

To name a few: 仅举几例:

  • Nashorn (built-in Java 8) Nashorn(内置Java 8)
  • Rhino 犀牛
  • V8 V8

SAMPLE CODE 示例代码

Here is a sample illustrating the use of Nashorn: 这是一个说明Nashorn使用的示例:

Caller.java Caller.java

ScriptEngine engine = new ScriptEngineManager().getEngineByName("nashorn");
engine.eval(new FileReader("script.js"));

Invocable invocable = (Invocable) engine;

Object result = invocable.invokeFunction("myFunction", "fooValue");

System.out.println(result);
System.out.println(result.getClass());

script.js 的script.js

function myFunction(foo) {
   // ...
}

SEE ALSO 也可以看看

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM