简体   繁体   中英

How to remove the html-like part of a string?

I've got a string that looks like this:

<b><!--
</b>if (window!= top)
top.location.href=location.href
<b>// -->
</b>
15 Minutes
EMIL (V.O.)
Just do what I do.  Say the same thing I
say.  Don't open your mouth.

I only want the string starting from "15 minutes", and based on the answer to another question on SO I've tried using regex like this:

def cleanhtml(raw_text):
    cleanr = re.compile('<.*?>.*?')
    cleantext = re.sub(cleanr, '', raw_text)
    return cleantext

But this does not clean the "if (window.= top) top.location.href=location.href" part of the string. What should I use for the regular expression then?

PS: I don't have an HTML file to begin with. The original datafile is already in .txt form.

You can use libraries already built to do exactly this.

To convert the text portion of the html you can use html2text

import html2text

html = '''
<b><!--

</b>if (window!= top)

top.location.href=location.href

<b>// -->

</b>

15 Minutes

EMIL (V.O.)

Just do what I do.  Say the same thing I

say.  Don't open your mouth.
'''

text_maker = html2text.HTML2Text() 
text_maker.strong_mark = False ##This prevents **** being added for <b>
text_maker.handle(html)

#"15 Minutes EMIL (V.O.) Just do what I do. Say the same thing I say. Don't open\nyour mouth.\n\n"

If you need to specify specific divs or classes you will need to use something like BeautifulSoup

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM