简体   繁体   中英

Cleaning HTML using python

I have the below code however i am receiving an error. I am trying to get the text from an html file between Tag1 and Tag2 without the for loop the code is working (for one file) however when looping in a directory it is not

from bs4 import BeautifulSoup
from urllib import urlopen
import os
import bleach
import re
rootdir = mydirectory
for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        url = file
        print url
        raw = urlopen(url).read()
        type(raw)
        Tag1 = raw.find("""<div class="song-text">""")
        Tag2 = raw.rfind("""<div style="text-align:center;padding-bottom:10px;">""")
        Cleaned = raw[Tag1+23:Tag2]
        print Cleaned

Error message: Traceback (most recent call last): File "TestClean.py", line 12, in raw = urlopen(url).read() File "/usr/lib/python2.7/urllib.py", line 87, in urlopen return opener.open(url) File "/usr/lib/python2.7/urllib.py", line 208, in open return getattr(self, name)(url) File "/usr/lib/python2.7/urllib.py", line 463, in open_file return self.open_local_file(url) File "/usr/lib/python2.7/urllib.py", line 477, in open_local_file raise IOError(e.errno, e.strerror, e.filename) IOError: [Errno 2] No such file or directory: 'paroles-a-beautiful-lie.html'

Error message indicates a lack of file. os.walk returns only the name of the file, but not the full path to it. 1) Take the path = os.path.join(subdir, file) 2) Read the file open(path).read() without urlopen

It is well clear from the Traceback that it is not able to find 'paroles-a-beautiful-lie.html' file. I would suggest you to go step by step.

  1. Comment the code below 'print url'.
  2. Check whether you are getting proper url.
  3. Then proceed with your next step - finding process.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM