從Markdown文件中刪除HTML注釋

Question

當從Markdown轉換為HTML時，這會派上用場，例如，如果需要阻止評論出現在最終的HTML源代碼中。

示例輸入my.md ：

# Contract Cancellation

Dear Contractor X, due to delays in our imports, we would like to ...
<!--
    ... due to a general shortage in the Y market
    TODO make sure to verify this before we include it here
-->
best,
me <!-- ... or should i be more formal here? -->

示例輸出my-filtered.md ：

# Contract Cancellation

Dear Contractor X, due to delays in our imports, we would like to ...

best,
me

在Linux上，我會做這樣的事情：

cat my.md | remove_html_comments > my-filtered.md

我也能編寫一個處理一些常見案例的AWK腳本，但據我所知，AWK和其他任何簡單文本操作的常用工具（如sed ）都不能完成這項工作。 人們需要使用HTML解析器。

如何編寫正確的remove_html_comments腳本，以及使用什么工具？

Answer 1

我從你的評論中看到你主要使用Pandoc。

2017年10月29日發布的Pandoc 2.0版添加了一個新選項--strip-comments 。 相關問題為此更改提供了一些背景信息。

升級到最新版本並在命令中添加--strip-comments應該刪除HTML注釋作為轉換過程的一部分。

Answer 2

這可能有點違反直覺，我會使用HTML解析器。

Python和BeautifulSoup的示例：

import sys
from bs4 import BeautifulSoup, Comment

md_input = sys.stdin.read()

soup = BeautifulSoup(md_input, "html5lib")

for element in soup(text=lambda text: isinstance(text, Comment)):
    element.extract()

# bs4 wraps the text in <html><head></head><body>…</body></html>,
# so we need to extract it:

output = "".join(map(str, soup.find("body").contents))

print(output)

輸出：

$ cat my.md | python md.py 
# Contract Cancellation

Dear Contractor X, due to delays in our imports, we would like to ...

best,
me

它不應該破壞你的.md文件中可能包含的任何其他HTML（它可能會改變代碼格式，但不是它的含義）：

當然，如果您決定使用它，請徹底測試它。

編輯 - 在線嘗試： https ：//repl.it/NQgG（從input.md讀取輸入，而不是stdin）

Answer 3

這awk應該工作

$ awk -v FS="" '{ for(i=1; i<=NF; i++){if($i$(i+1)$(i+2)$(i+3)=="<!--"){i+=4; p=1} else if(!p && $i!="-->"){printf $i} else if($i$(i+1)$(i+2)=="-->") {i+=3; p=0;} } printf RS}' file
Dear Contractor X, due to delays in our imports, we would like to ...



best,
me

為了更好的可讀性和解釋：

awk -v FS=""                                 # Set null as field separator so that each character is treated as a field and it will prevent the formatting as well
    '{ 
        for(i=1; i<=NF; i++)                 # Iterate through each character
        {
            if($i$(i+1)$(i+2)$(i+3)=="<!--") # If combination of 4 chars makes a comment start tag
                {                            # then raise flag p and increment i by 4
                    i+=4; p=1                
                } 
            else if(!p && $i!="-->")         # if p==0 then print the character
                 printf $i 
            else if($i$(i+1)$(i+2)=="-->")   # if combination of 3 fields forms comment close tag 
                {                            # then reset flag and increment i by 3
                    i+=3; p=0;
                }

        } 

        printf RS

        }' file

Answer 4

如果你用vim打開它，你可以這樣做：

:%s/<!--\_.\{-}-->//g

用_。 你允許正則表達式匹配所有字符，甚至是新行字符，{ - }是為了使它變得懶惰，否則你將丟失從第一個到最后一個注釋的所有內容。

我試圖在sed上使用相同的表達式，但它不會工作。

Answer 5

我的AWK解決方案，可能更容易理解@batMan，至少對於高級開發者來說。 功能應該大致相同。

文件remove_html_comments ：

#!/usr/bin/awk -f
# Removes common, simple cases of HTML comments.
#
# Example:
# > cat my.html | remove_html_comments > my-filtered.html
#
# This is usefull for example to pre-parse Markdown before generating
# an HTML or PDF document, to make sure the commented out content
# does not end up in the final document, # not even as a comment
# in source code.
#
# Example:
# > cat my.markdown | remove_html_comments | pandoc -o my-filtered.html
#
# Source: hoijui
# License: CC0 1.0 - https://creativecommons.org/publicdomain/zero/1.0/

BEGIN {
    com_lvl = 0;
}

/<!--/ {
    if (com_lvl == 0) {
        line = $0
        sub(/<!--.*/, "", line)
        printf line
    }
    com_lvl = com_lvl + 1
}

com_lvl == 0

/-->/ {
    if (com_lvl == 1) {
        line = $0
        sub(/.*-->/, "", line)
        print line
    }
    com_lvl = com_lvl - 1;
}

從Markdown文件中刪除HTML注釋

問題描述

5 個解決方案

解決方案1
3 2017-11-01 12:06:24

解決方案2
1 2017-10-26 12:28:03

解決方案3
1 2017-10-26 13:27:29

解決方案4
0 2017-10-26 11:21:26

解決方案5
0 2017-11-01 10:38:16

從Markdown文件中刪除HTML注釋

問題描述

5 個解決方案

解決方案1 3 2017-11-01 12:06:24

解決方案2 1 2017-10-26 12:28:03

解決方案3 1 2017-10-26 13:27:29

解決方案4 0 2017-10-26 11:21:26

解決方案5 0 2017-11-01 10:38:16

解決方案1
3 2017-11-01 12:06:24

解決方案2
1 2017-10-26 12:28:03

解決方案3
1 2017-10-26 13:27:29

解決方案4
0 2017-10-26 11:21:26

解決方案5
0 2017-11-01 10:38:16