从Markdown文件中删除HTML注释

Question

This would come in handy when converting from Markdown to HTML, for example, if one needs to prevent comments from appearing in the final HTML source. 当从Markdown转换为HTML时，这会派上用场，例如，如果需要阻止评论出现在最终的HTML源代码中。

Example input my.md : 示例输入my.md ：

# Contract Cancellation

Dear Contractor X, due to delays in our imports, we would like to ...
<!--
    ... due to a general shortage in the Y market
    TODO make sure to verify this before we include it here
-->
best,
me <!-- ... or should i be more formal here? -->

Example output my-filtered.md : 示例输出my-filtered.md ：

# Contract Cancellation

Dear Contractor X, due to delays in our imports, we would like to ...

best,
me

On Linux, I would do something like this: 在Linux上，我会做这样的事情：

cat my.md | remove_html_comments > my-filtered.md

I am also able to write an AWK script that handles some common cases, but as I understood, neither AWK nor any other of the common tools for simple text manipulation (like sed ) are really up to this job. 我也能编写一个处理一些常见案例的AWK脚本，但据我所知，AWK和其他任何简单文本操作的常用工具（如sed ）都不能完成这项工作。 One would need to use an HTML parser. 人们需要使用HTML解析器。

How to write a proper remove_html_comments script, and with what tools? 如何编写正确的remove_html_comments脚本，以及使用什么工具？

Answer 1

I see from your comment that you mostly use Pandoc. 我从你的评论中看到你主要使用Pandoc。

Pandoc version 2.0 , released October 29, 2017, adds a new option --strip-comments . 2017年10月29日发布的Pandoc 2.0版添加了一个新选项--strip-comments 。 The related issue provides some context to this change. 相关问题为此更改提供了一些背景信息。

Upgrading to the latest version and adding --strip-comments to your command should remove HTML comments as part of the conversion process. 升级到最新版本并在命令中添加--strip-comments应该删除HTML注释作为转换过程的一部分。

Answer 2

It might be a bit counter-intuitive, bud i would use a HTML parser. 这可能有点违反直觉，我会使用HTML解析器。

Example with Python and BeautifulSoup: Python和BeautifulSoup的示例：

import sys
from bs4 import BeautifulSoup, Comment

md_input = sys.stdin.read()

soup = BeautifulSoup(md_input, "html5lib")

for element in soup(text=lambda text: isinstance(text, Comment)):
    element.extract()

# bs4 wraps the text in <html><head></head><body>…</body></html>,
# so we need to extract it:

output = "".join(map(str, soup.find("body").contents))

print(output)

Output: 输出：

$ cat my.md | python md.py 
# Contract Cancellation

Dear Contractor X, due to delays in our imports, we would like to ...

best,
me

It shouldn't break any other HTML you might have in your .md files (it might change the code formatting a bit, but not it's meaning): 它不应该破坏你的.md文件中可能包含的任何其他HTML（它可能会改变代码格式，但不是它的含义）：

Of course test it thouroughly if you decide to use it. 当然，如果您决定使用它，请彻底测试它。

Edit – Try it out online here: https://repl.it/NQgG (input is read from input.md, not stdin) 编辑 - 在线尝试： https ：//repl.it/NQgG（从input.md读取输入，而不是stdin）

Answer 3

This awk should work 这awk应该工作

$ awk -v FS="" '{ for(i=1; i<=NF; i++){if($i$(i+1)$(i+2)$(i+3)=="<!--"){i+=4; p=1} else if(!p && $i!="-->"){printf $i} else if($i$(i+1)$(i+2)=="-->") {i+=3; p=0;} } printf RS}' file
Dear Contractor X, due to delays in our imports, we would like to ...



best,
me

For better readability and explanation : 为了更好的可读性和解释：

awk -v FS=""                                 # Set null as field separator so that each character is treated as a field and it will prevent the formatting as well
    '{ 
        for(i=1; i<=NF; i++)                 # Iterate through each character
        {
            if($i$(i+1)$(i+2)$(i+3)=="<!--") # If combination of 4 chars makes a comment start tag
                {                            # then raise flag p and increment i by 4
                    i+=4; p=1                
                } 
            else if(!p && $i!="-->")         # if p==0 then print the character
                 printf $i 
            else if($i$(i+1)$(i+2)=="-->")   # if combination of 3 fields forms comment close tag 
                {                            # then reset flag and increment i by 3
                    i+=3; p=0;
                }

        } 

        printf RS

        }' file

Answer 4

If you open it with vim you could do: 如果你用vim打开它，你可以这样做：

:%s/<!--\_.\{-}-->//g

With _. 用_。 you allow the regular expression to match all characters even the new line character, the {-} is for making it lazy, otherwise you will lose all content from the first to the last comment. 你允许正则表达式匹配所有字符，甚至是新行字符，{ - }是为了使它变得懒惰，否则你将丢失从第一个到最后一个注释的所有内容。

I have tried to use the same expression on sed but it wont work. 我试图在sed上使用相同的表达式，但它不会工作。

Answer 5

my AWK solution, probably more easily to understand then the one of @batMan, at least for high-level devs. 我的AWK解决方案，可能更容易理解@batMan，至少对于高级开发者来说。 the functionality should be about the same. 功能应该大致相同。

file remove_html_comments : 文件remove_html_comments ：

#!/usr/bin/awk -f
# Removes common, simple cases of HTML comments.
#
# Example:
# > cat my.html | remove_html_comments > my-filtered.html
#
# This is usefull for example to pre-parse Markdown before generating
# an HTML or PDF document, to make sure the commented out content
# does not end up in the final document, # not even as a comment
# in source code.
#
# Example:
# > cat my.markdown | remove_html_comments | pandoc -o my-filtered.html
#
# Source: hoijui
# License: CC0 1.0 - https://creativecommons.org/publicdomain/zero/1.0/

BEGIN {
    com_lvl = 0;
}

/<!--/ {
    if (com_lvl == 0) {
        line = $0
        sub(/<!--.*/, "", line)
        printf line
    }
    com_lvl = com_lvl + 1
}

com_lvl == 0

/-->/ {
    if (com_lvl == 1) {
        line = $0
        sub(/.*-->/, "", line)
        print line
    }
    com_lvl = com_lvl - 1;
}

从Markdown文件中删除HTML注释

问题描述

5 个解决方案

解决方案1
3 2017-11-01 12:06:24

解决方案2
1 2017-10-26 12:28:03

解决方案3
1 2017-10-26 13:27:29

解决方案4
0 2017-10-26 11:21:26

解决方案5
0 2017-11-01 10:38:16

从Markdown文件中删除HTML注释

问题描述

5 个解决方案

解决方案1 3 2017-11-01 12:06:24

解决方案2 1 2017-10-26 12:28:03

解决方案3 1 2017-10-26 13:27:29

解决方案4 0 2017-10-26 11:21:26

解决方案5 0 2017-11-01 10:38:16

解决方案1
3 2017-11-01 12:06:24

解决方案2
1 2017-10-26 12:28:03

解决方案3
1 2017-10-26 13:27:29

解决方案4
0 2017-10-26 11:21:26

解决方案5
0 2017-11-01 10:38:16