How to remove the html-like part of a string?

Question

I've got a string that looks like this:

<b><!--
</b>if (window!= top)
top.location.href=location.href
<b>// -->
</b>
15 Minutes
EMIL (V.O.)
Just do what I do.  Say the same thing I
say.  Don't open your mouth.

I only want the string starting from "15 minutes", and based on the answer to another question on SO I've tried using regex like this:

def cleanhtml(raw_text):
    cleanr = re.compile('<.*?>.*?')
    cleantext = re.sub(cleanr, '', raw_text)
    return cleantext

But this does not clean the "if (window.= top) top.location.href=location.href" part of the string. What should I use for the regular expression then?

PS: I don't have an HTML file to begin with. The original datafile is already in .txt form.

Answer 1

You can use libraries already built to do exactly this.

To convert the text portion of the html you can use html2text

import html2text

html = '''
<b><!--

</b>if (window!= top)

top.location.href=location.href

<b>// -->

</b>

15 Minutes

EMIL (V.O.)

Just do what I do.  Say the same thing I

say.  Don't open your mouth.
'''

text_maker = html2text.HTML2Text() 
text_maker.strong_mark = False ##This prevents **** being added for <b>
text_maker.handle(html)

#"15 Minutes EMIL (V.O.) Just do what I do. Say the same thing I say. Don't open\nyour mouth.\n\n"

If you need to specify specific divs or classes you will need to use something like BeautifulSoup

How to remove the html-like part of a string?

Question

1 answers

solution1
1 2021-03-03 01:12:03

How to remove the html-like part of a string?

Question

1 answers

solution1 1 2021-03-03 01:12:03

solution1
1 2021-03-03 01:12:03