备注：如何解析 HTML 标签及其在 MDAST 中的内容

Question

I'm trying to parse a GitHub-flavoured markdown file using Unified and Remark-Parse to generate a MDAST.我正在尝试使用Unified和Remark-Parse解析 GitHub 风格的 markdown 文件以生成 MDAST。 I'm able to parse most of it correctly and easily, however I'm having trouble parsing the HTML tags and their content from the AST.我能够正确且轻松地解析其中的大部分内容，但是我无法从 AST 中解析 HTML 标签及其内容。

In the AST, HTML tags and their contents are represented as siblings, not parent-child.在 AST 中，HTML 标签及其内容表示为兄弟姐妹，而不是父子节点。 For example <sub>hi</sub> is parsed into例如<sub>hi</sub>被解析成

[
  {
    "type": "paragraph",
    "children": [
      {
        "type": "html",
        "value": "<sub>",
      },
      {
        "type": "text",
        "value": "hi",
      },
      {
        "type": "html",
        "value": "</sub>",
      }
    ]
  }
]

Ideally, I would want it to be parsed like理想情况下，我希望它被解析为

[
  {
    "type": "paragraph",
    "children": [
      {
        "type": "html",
        "value": "sub",
        "children": [
          {
            "type": "text",
            "value": "hi",
          },
        ]
      },
    ]
  }
]

so that I can access the tag type and its content.这样我就可以访问标签类型及其内容。 (Specifically, my goal is to just skip over the tags and their content as they are not needed for my purposes) （具体来说，我的目标是跳过标签及其内容，因为我的目的不需要它们）

This is the configuration I am using currently:这是我目前使用的配置：

import unified from 'unified';
import markdown from 'remark-parse';
import type {Block} from '@notionhq/client/build/src/api-types';
import {parseRoot} from './internal';
import gfm from 'remark-gfm';

export function parseBody(body: string): Block[] {
  const tokens = unified().use(markdown).use(gfm).parse(body);
  return parseRoot(tokens);
}

So, my question is: Is there a way of configuring Remark to do so / is there a Remark plugin to do this?所以，我的问题是：有没有办法配置 Remark 这样做/是否有 Remark 插件可以做到这一点？ If not, how would I go about creating a plugin that does so?如果没有，我将如何 go 创建一个这样做的插件？

Thanks.谢谢。

Answer 1

first: why the AST looks as it does and why Remark most likely does not have an option to do it differently第一：为什么 AST 看起来像它一样，为什么 Remark 很可能没有选择不同的方法

The reason that the AST represents it that way is because that is what the CommonMark specification specifies for raw inline HTML and for HTML blocks . AST 以这种方式表示它的原因是 CommonMark 规范为原始内联 HTML和HTML 块指定的内容。 Specifically, CommonMark specifies that HTML tags are passed through, not parsed .具体来说，CommonMark 指定HTML 标签被传递，而不是被解析。

For inline HTML, the spec supports inline HTML tags , which is not the same as supporting inline HTML .对于内联 HTML，规范支持内联 HTML标签，这与支持内联 HTML 不同。 Tags are simply passed through as-is.标签只是按原样传递。 There is no matching of opening and closing tags.没有匹配的开始和结束标签。 The reasons for this are:原因如下：

performance表现
parser complexity解析器复杂度
HTML tags are only supported as a "use at your own risk" "last resort" option when Markdown doesn't have a feature you need. HTML 标签仅在 Markdown 没有您需要的功能时作为“使用风险自负”“最后手段”选项被支持。

For a small number of HTML tags, open and close tag matching is supported at the block-level.对于少数 HTML 标签，在块级别支持打开和关闭标签匹配。 pre , script , style , and textarea , the latter only added recently in v0.30 of the spec. pre 、 script 、 style和textarea ，后者最近才在规范的 v0.30 中添加。

You can read the above linked parts of the spec, and search the discussions in the CommonMark forum to get more understanding of the whys, but to get right to the point, read:您可以阅读规范的上述链接部分，并在CommonMark 论坛中搜索讨论以更好地了解原因，但要直截了当，请阅读：

This explanation within the spec for the choices made.规范中对所做选择的解释。
Skip to [the Raw HTML section of this forum]( the https://talk.commonmark.org/t/beyond-markdown/2787?u=vas ) post by the CommonMark spec author and maintainer, John MacFarlane (@jgm).跳到 [本论坛的原始 HTML部分]( https://talk.commonmark.org/t/beyond-markdown/2787?u=vas ) 由 CommonMark 规范作者和维护者 John MacFarlane (@jgm) 发布.
This forum question and also this one and @jgm's answers. 这个论坛问题以及这个问题和@jgm 的答案。

second: what you can do about it第二：你能做些什么

Remark is "part of the unified collective", which is an infrastructure centered around the processing of AST (abstract syntax trees). Remark 是“统一集合的一部分”，它是以处理 AST（抽象语法树）为中心的基础设施。 From your question, it sounds like you already get this.从你的问题来看，听起来你已经明白了。

There is lot's of help on unified's pages for how to write plugins:统一的页面上有很多关于如何编写插件的帮助：

But the best way to both learn how to do this and to get a quick jump on an implementation is to look at the many existing mdast-specific manipulators .但是，了解如何执行此操作并快速了解实现的最佳方法是查看许多现有的 mdast 特定操纵器。

备注：如何解析 HTML 标签及其在 MDAST 中的内容

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-06-23 14:06:30

first: why the AST looks as it does and why Remark most likely does not have an option to do it differently第一：为什么 AST 看起来像它一样，为什么 Remark 很可能没有选择不同的方法

second: what you can do about it第二：你能做些什么

备注：如何解析 HTML 标签及其在 MDAST 中的内容

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-06-23 14:06:30

first: why the AST looks as it does and why Remark most likely does not have an option to do it differently第一：为什么 AST 看起来像它一样，为什么 Remark 很可能没有选择不同的方法

second: what you can do about it第二：你能做些什么

解决方案1
1 已采纳 2021-06-23 14:06:30