How to open multiple .html files AND remove tags with Python (pyCharm)

Question

I'm currently working on a project in Python, where I have to write a programm which removes all tags from an HTML file (so only the text remains) but I need to do that for about 1000 HTML files.

This is the code I used for removing TAGS:



with open('/inetpub/wwwroot/content/html/eng/0320-0130.htm') as html_file:
    source = html_file.read()
    html = HTML (html = source)



print(html.text)

&

This is the code which opens them multiple HTML files:

import glob
path = '/inetpub/wwwroot/content/html/eng/*.htm'
files=glob.glob(path)
for file in files:
    f=open(file, 'r')
    print('%s' % f.readlines())
    f.close()

I don't know how to combine these codes or which code I need for such a combination. Any suggestions ?

Answer 1

Maybe you got confused because of the with context used in the first one, but combining those two programs is fairly simple

import glob


def get_html_text(html_file):
    source = html_file.read()
    html = HTML(html=source)
    return html.text


path = '/inetpub/wwwroot/content/html/eng/*.htm'
files = glob.glob(path)
html_texts = []
for file in files:
    f = open(file, 'r')
    html_texts.append(get_html_text(f))
    f.close()
print(len(html_texts))
# print(html_text) # this may lay huge print to your screen if you have many files

The with context all it had to do in your above program is that it defines entrance and exit action, that is on entering this piece of code under the with context, you get the file open and on exiting it, the file is closed. You don't need it since you already it close the file manually in your second program calling the first one.

How to open multiple .html files AND remove tags with Python (pyCharm)

Question

1 answers

solution1
0 2020-03-20 13:12:51

How to open multiple .html files AND remove tags with Python (pyCharm)

Question

1 answers

solution1 0 2020-03-20 13:12:51

solution1
0 2020-03-20 13:12:51