简体   繁体   中英

Python - reading .htm files listed in a txt file

I am using below to read some .htm files.

from bs4 import BeautifulSoup
import os

BASEDIR = "C:\\designers"
aa = os.listdir(BASEDIR)

text_file = open(os.path.join(BASEDIR, 'all htm.txt'), "w")

for b in aa:
    if b.endswith('.htm'):
        c = os.path.join(BASEDIR, b)
        text_file.write(c)
        text_file.write('\n')


text_file.close()

list_open = open(os.path.join(BASEDIR, 'all htm.txt'))
read_list = list_open.read()
line_in_list = read_list.split('\n')

for i, ef in enumerate(line_in_list):
    page = open(ef)
    soup = BeautifulSoup(page.read())
    print i
    print soup

however it only reads the first file and then gives error:

IOError: [Errno 22] invalid mode ('r') or filename: ''

what went wrong?

thanks.


'kev' pointed out the problem: there are unwanted line in the txt file.

there are many ways to remove empty lines in txt.

in addition to that, the last part can be changed to:

for i, ef in enumerate(line_in_list):
    if '.htm' in ef:         # or 'len(ef) > 1' etc...
    page = open(ef)
    soup = BeautifulSoup(page.read())
    print i
    print soup

Because you are writing \\n at the end of every line when you create 'all htm.txt' (regardless of if it is the last line) you end up with an empty line at the end of your file. You thus end up with an empty string at the end of line_in_list when you split on the newline character.

Instead, do enumerate(line_in_list[:-1]) which will ignore the final (empty) element.

Alternatively, you could make your code more robust by putting a try: except: block around each loop of your iteration and gracefully handle/ignore exceptions when they occur. This will protect you against future problems in your code:

For example:

for i, ef in enumerate(line_in_list):
    try:
        page = open(ef)
        soup = BeautifulSoup(page.read())
        print i
        print soup
    except IoError:
        print 'ignoring file %s'%ef
    except Exception:
        print 'an unhandled exception occurred for file %s'%ef

It would be interesting in which line of the code the error occurs.

Be careful with the lines b read from the file aa . They end with a newline \\n . So, the IF condition will never be true and you are producing an empty file all html.txt .

Try

x=b.strip()
    if(x.endswith(".htm")):
        ....

This will cut any whitespace (anything like space, carriageReturn, tab, newLine) at the beginning and end of b.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM