简体   繁体   中英

Python regex - cleaning markdown html

I'm trying to figure out a good way to sanitize / reformat User Generated Content that is written in the Markdown format. I want to 'correct' improper content (as best as possible).

For now I'm sticking to HTML comments ( though I'd appreciate any embedded HTML ).

The markdown format requires any embedded HTML to appear within it's own lines.

Bad (input):

one
<!-- two -->
three
four
five <!-- five.point.five -->
six

Good (output):

one

<!-- two -->

three
four
five

<!-- five.point.five -->

six

您可以使用此:

re.sub(r'\s*(<!--(?:[^-]+|-(?!->))*-->)\s*', '\\n\\n\\1\\n\\n', yourstring)

To convert the first output to the second you would replace <!-- with \\r\\n<!-- and --> with -->\\r\\n , or whatever newline character, or constant, is equivalent to \\r\\n . You could do this with replace() , probably not requiring regex. [ \\r is not really necessary.]

You seem to suggest that you are doing this already, so perhaps there is more to your question.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM