I've got a string that looks like this:
<b><!--
</b>if (window!= top)
top.location.href=location.href
<b>// -->
</b>
15 Minutes
EMIL (V.O.)
Just do what I do. Say the same thing I
say. Don't open your mouth.
I only want the string starting from "15 minutes", and based on the answer to another question on SO I've tried using regex like this:
def cleanhtml(raw_text):
cleanr = re.compile('<.*?>.*?')
cleantext = re.sub(cleanr, '', raw_text)
return cleantext
But this does not clean the "if (window.= top) top.location.href=location.href"
part of the string. What should I use for the regular expression then?
PS: I don't have an HTML file to begin with. The original datafile is already in .txt
form.
You can use libraries already built to do exactly this.
To convert the text portion of the html
you can use html2text
import html2text
html = '''
<b><!--
</b>if (window!= top)
top.location.href=location.href
<b>// -->
</b>
15 Minutes
EMIL (V.O.)
Just do what I do. Say the same thing I
say. Don't open your mouth.
'''
text_maker = html2text.HTML2Text()
text_maker.strong_mark = False ##This prevents **** being added for <b>
text_maker.handle(html)
#"15 Minutes EMIL (V.O.) Just do what I do. Say the same thing I say. Don't open\nyour mouth.\n\n"
If you need to specify specific divs
or classes
you will need to use something like BeautifulSoup
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.