Remove HTML comments from Markdown file

Question

This would come in handy when converting from Markdown to HTML, for example, if one needs to prevent comments from appearing in the final HTML source.

Example input my.md :

# Contract Cancellation

Dear Contractor X, due to delays in our imports, we would like to ...
<!--
    ... due to a general shortage in the Y market
    TODO make sure to verify this before we include it here
-->
best,
me <!-- ... or should i be more formal here? -->

Example output my-filtered.md :

# Contract Cancellation

Dear Contractor X, due to delays in our imports, we would like to ...

best,
me

On Linux, I would do something like this:

cat my.md | remove_html_comments > my-filtered.md

I am also able to write an AWK script that handles some common cases, but as I understood, neither AWK nor any other of the common tools for simple text manipulation (like sed ) are really up to this job. One would need to use an HTML parser.

How to write a proper remove_html_comments script, and with what tools?

Answer 1

I see from your comment that you mostly use Pandoc.

Pandoc version 2.0 , released October 29, 2017, adds a new option --strip-comments . The related issue provides some context to this change.

Upgrading to the latest version and adding --strip-comments to your command should remove HTML comments as part of the conversion process.

Answer 2

It might be a bit counter-intuitive, bud i would use a HTML parser.

Example with Python and BeautifulSoup:

import sys
from bs4 import BeautifulSoup, Comment

md_input = sys.stdin.read()

soup = BeautifulSoup(md_input, "html5lib")

for element in soup(text=lambda text: isinstance(text, Comment)):
    element.extract()

# bs4 wraps the text in <html><head></head><body>…</body></html>,
# so we need to extract it:

output = "".join(map(str, soup.find("body").contents))

print(output)

Output:

$ cat my.md | python md.py 
# Contract Cancellation

Dear Contractor X, due to delays in our imports, we would like to ...

best,
me

It shouldn't break any other HTML you might have in your .md files (it might change the code formatting a bit, but not it's meaning):

Of course test it thouroughly if you decide to use it.

Edit – Try it out online here: https://repl.it/NQgG (input is read from input.md, not stdin)

Answer 3

This awk should work

$ awk -v FS="" '{ for(i=1; i<=NF; i++){if($i$(i+1)$(i+2)$(i+3)=="<!--"){i+=4; p=1} else if(!p && $i!="-->"){printf $i} else if($i$(i+1)$(i+2)=="-->") {i+=3; p=0;} } printf RS}' file
Dear Contractor X, due to delays in our imports, we would like to ...



best,
me

For better readability and explanation :

awk -v FS=""                                 # Set null as field separator so that each character is treated as a field and it will prevent the formatting as well
    '{ 
        for(i=1; i<=NF; i++)                 # Iterate through each character
        {
            if($i$(i+1)$(i+2)$(i+3)=="<!--") # If combination of 4 chars makes a comment start tag
                {                            # then raise flag p and increment i by 4
                    i+=4; p=1                
                } 
            else if(!p && $i!="-->")         # if p==0 then print the character
                 printf $i 
            else if($i$(i+1)$(i+2)=="-->")   # if combination of 3 fields forms comment close tag 
                {                            # then reset flag and increment i by 3
                    i+=3; p=0;
                }

        } 

        printf RS

        }' file

Answer 4

If you open it with vim you could do:

:%s/<!--\_.\{-}-->//g

With _. you allow the regular expression to match all characters even the new line character, the {-} is for making it lazy, otherwise you will lose all content from the first to the last comment.

I have tried to use the same expression on sed but it wont work.

Answer 5

my AWK solution, probably more easily to understand then the one of @batMan, at least for high-level devs. the functionality should be about the same.

file remove_html_comments :

#!/usr/bin/awk -f
# Removes common, simple cases of HTML comments.
#
# Example:
# > cat my.html | remove_html_comments > my-filtered.html
#
# This is usefull for example to pre-parse Markdown before generating
# an HTML or PDF document, to make sure the commented out content
# does not end up in the final document, # not even as a comment
# in source code.
#
# Example:
# > cat my.markdown | remove_html_comments | pandoc -o my-filtered.html
#
# Source: hoijui
# License: CC0 1.0 - https://creativecommons.org/publicdomain/zero/1.0/

BEGIN {
    com_lvl = 0;
}

/<!--/ {
    if (com_lvl == 0) {
        line = $0
        sub(/<!--.*/, "", line)
        printf line
    }
    com_lvl = com_lvl + 1
}

com_lvl == 0

/-->/ {
    if (com_lvl == 1) {
        line = $0
        sub(/.*-->/, "", line)
        print line
    }
    com_lvl = com_lvl - 1;
}

Remove HTML comments from Markdown file

Question

5 answers

solution1
3 2017-11-01 12:06:24

solution2
1 2017-10-26 12:28:03

solution3
1 2017-10-26 13:27:29

solution4
0 2017-10-26 11:21:26

solution5
0 2017-11-01 10:38:16

Remove HTML comments from Markdown file

Question

5 answers

solution1 3 2017-11-01 12:06:24

solution2 1 2017-10-26 12:28:03

solution3 1 2017-10-26 13:27:29

solution4 0 2017-10-26 11:21:26

solution5 0 2017-11-01 10:38:16

solution1
3 2017-11-01 12:06:24

solution2
1 2017-10-26 12:28:03

solution3
1 2017-10-26 13:27:29

solution4
0 2017-10-26 11:21:26

solution5
0 2017-11-01 10:38:16