简体   繁体   English

Markdown 段落标记正则表达式

[英]Markdown paragraph tag regex

I'm after a regex which can support the following test case:我正在寻找一个可以支持以下测试用例的正则表达式:

This should
all be
one match

#this should not match
1. nor this
> nor this
this should be a second match

So I can then wrap these two matches in <p> tags.所以我可以将这两个匹配项包装在 <p> 标签中。 However, I'm getting stuck on the newlines: two should break the match, but one newline cannot.但是,我被困在换行符上:两个应该破坏匹配,但一个换行符不能。 Here's the closest I've come:这是我最近的一次:

(^[A-z].+)

This correctly grabs all the desired text, but forms four matches instead of the desired two.这正确地获取了所有所需的文本,但形成了四个匹配项而不是所需的两个匹配项。

On Regexr在 Regexr 上

Thank you for your time.感谢您的时间。

You can use您可以使用

/^[A-Za-z].*(?:\n[A-Za-z].*)*/gm

See the regex demo .请参阅正则表达式演示

Details细节

  • ^ - start of a line (due to m modifier) ^ - 行首(由于m修饰符)
  • [A-Za-z] - an ASCII letter [A-Za-z] - 一个 ASCII 字母
  • .* - the rest of the line .* - 该行的其余部分
  • (?:\\n[A-Za-z].*)* - zero or more lines starting with an ASCII letter. (?:\\n[A-Za-z].*)* - 零个或多个以 ASCII 字母开头的行。

This is not something you can likely do with a single regular expression.这不是您可以使用单个正则表达式完成的事情。 While it is true that some Markdown implementations use regex as their primary tool for parsing Markdown, they use a series of expressions to do so (see the original implementation, markdown.pl , for example).虽然某些 Markdown 实现确实使用正则表达式作为解析 Markdown 的主要工具,但它们使用了一系列表达式来这样做(例如,参见原始实现, markdown.pl )。

For example, you might have an expression which matches headers, an expression which matches list items, an expression which matches blockquotes and an expression which matches any block of text.例如,您可能有一个匹配标题的表达式、一个匹配列表项的表达式、一个匹配块引用的表达式和一个匹配任何文本块的表达式。 Each of those expressions would be run against the input in turn.这些表达式中的每一个都将依次针对输入运行。 However, by the time that the last expression is run, the previous expressions in the series have already consumed the various other elements.但是,到最后一个表达式运行时,系列中的前一个表达式已经消耗了各种其他元素。 Therefore, the final expression for matching paragraphs does not need to account for matches of headers, lists or blockquotes.因此,匹配段落的最终表达式不需要考虑标题、列表或块引用的匹配。

In fact, if you were to remove the header expression from markdown.pl , then all headers would simply be wrapped in <p> tags with the hashes ( # ) still included in the text.事实上,如果您要从markdown.pl删除标题表达式,那么所有标题都将简单地包含在<p>标签中,并且散列 ( # ) 仍包含在文本中。

My point is you would need to implement a full Markdown parser.我的观点是你需要实现一个完整的 Markdown 解析器。 However, there are many implementations that already exist.但是,已经存在许多实现。 You would likely be better off using one which already exists.使用已经存在的方法可能会更好。 In fact, most modern implementations generate an abstract syntax tree (AST) rather than doing regex substitutions (as pointed out in another answer).事实上,大多数现代实现都会生成抽象语法树 (AST) 而不是进行正则表达式替换(如另一个答案中所指出的)。

Grammars like Markdown/Commonmark can't be [easily, if at all] parsed with a regular expression.像 Markdown/Commonmark 这样的语法不能[轻松地,如果有的话] 用正则表达式解析。

Use a proper parser that will produce an AST you can manipulate.使用合适的解析器来生成您可以操作的 AST。 For instance,例如,

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM