I'm currently working on a project in Python, where I have to write a programm which removes all tags from an HTML file (so only the text remains) but I need to do that for about 1000 HTML files.
This is the code I used for removing TAGS:
with open('/inetpub/wwwroot/content/html/eng/0320-0130.htm') as html_file:
source = html_file.read()
html = HTML (html = source)
print(html.text)
&
This is the code which opens them multiple HTML files:
import glob
path = '/inetpub/wwwroot/content/html/eng/*.htm'
files=glob.glob(path)
for file in files:
f=open(file, 'r')
print('%s' % f.readlines())
f.close()
I don't know how to combine these codes or which code I need for such a combination. Any suggestions ?
Maybe you got confused because of the with
context used in the first one, but combining those two programs is fairly simple
import glob
def get_html_text(html_file):
source = html_file.read()
html = HTML(html=source)
return html.text
path = '/inetpub/wwwroot/content/html/eng/*.htm'
files = glob.glob(path)
html_texts = []
for file in files:
f = open(file, 'r')
html_texts.append(get_html_text(f))
f.close()
print(len(html_texts))
# print(html_text) # this may lay huge print to your screen if you have many files
The with
context all it had to do in your above program is that it defines entrance and exit action, that is on entering this piece of code under the with
context, you get the file open and on exiting it, the file is closed. You don't need it since you already it close the file manually in your second program calling the first one.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.