Python regex - cleaning markdown html

Question

I'm trying to figure out a good way to sanitize / reformat User Generated Content that is written in the Markdown format. I want to 'correct' improper content (as best as possible).

For now I'm sticking to HTML comments ( though I'd appreciate any embedded HTML ).

The markdown format requires any embedded HTML to appear within it's own lines.

Bad (input):

one
<!-- two -->
three
four
five <!-- five.point.five -->
six

Good (output):

one

<!-- two -->

three
four
five

<!-- five.point.five -->

six

Answer 1

您可以使用此：

re.sub(r'\s*(<!--(?:[^-]+|-(?!->))*-->)\s*', '\\n\\n\\1\\n\\n', yourstring)

Answer 2

To convert the first output to the second you would replace  with -->\\r\\n , or whatever newline character, or constant, is equivalent to \\r\\n . You could do this with replace() , probably not requiring regex. [ \\r is not really necessary.]

You seem to suggest that you are doing this already, so perhaps there is more to your question.

Python regex - cleaning markdown html

Question

2 answers

solution1
1 ACCPTED 2013-06-15 01:44:28

solution2
0 2013-06-15 01:39:35

Python regex - cleaning markdown html

Question

2 answers

solution1 1 ACCPTED 2013-06-15 01:44:28

solution2 0 2013-06-15 01:39:35

solution1
1 ACCPTED 2013-06-15 01:44:28

solution2
0 2013-06-15 01:39:35