[英]Cleaning HTML using python
I have the below code however i am receiving an error. 我有以下代码,但是我收到错误。 I am trying to get the text from an html file between Tag1 and Tag2 without the for loop the code is working (for one file) however when looping in a directory it is not
我正在尝试从Tag1和Tag2之间的html文件中获取文本,而没有for循环,代码正在工作(对于一个文件),但是在目录中循环时却没有
from bs4 import BeautifulSoup
from urllib import urlopen
import os
import bleach
import re
rootdir = mydirectory
for subdir, dirs, files in os.walk(rootdir):
for file in files:
url = file
print url
raw = urlopen(url).read()
type(raw)
Tag1 = raw.find("""<div class="song-text">""")
Tag2 = raw.rfind("""<div style="text-align:center;padding-bottom:10px;">""")
Cleaned = raw[Tag1+23:Tag2]
print Cleaned
Error message: Traceback (most recent call last): File "TestClean.py", line 12, in raw = urlopen(url).read() File "/usr/lib/python2.7/urllib.py", line 87, in urlopen return opener.open(url) File "/usr/lib/python2.7/urllib.py", line 208, in open return getattr(self, name)(url) File "/usr/lib/python2.7/urllib.py", line 463, in open_file return self.open_local_file(url) File "/usr/lib/python2.7/urllib.py", line 477, in open_local_file raise IOError(e.errno, e.strerror, e.filename) IOError: [Errno 2] No such file or directory: 'paroles-a-beautiful-lie.html'
错误消息:追溯(最近一次呼叫最近):文件“ TestClean.py”,行12,原始= urlopen(url).read()文件“ /usr/lib/python2.7/urllib.py”,行87 ,在urlopen中返回opener.open(url)文件“ /usr/lib/python2.7/urllib.py”,第208行,在open中返回getattr(self,name)(url)文件“ / usr / lib / python2。 7 / urllib.py“,第463行,在open_file中返回self.open_local_file(url)文件,” / usr / lib / python2.7 / urllib.py“,第477行,在open_local_file中,引发IOError(e.errno,e.strerror ,例如e.filename)IOError:[Errno 2]没有这样的文件或目录:'paroles-a-beautiful-lie.html'
Error message indicates a lack of file. 错误消息表明缺少文件。
os.walk
returns only the name of the file, but not the full path to it. os.walk
仅返回文件名,而不返回文件的完整路径。 1) Take the path = os.path.join(subdir, file)
2) Read the file open(path).read()
without urlopen
1)取
path = os.path.join(subdir, file)
2)读取文件open(path).read()
而不使用urlopen
It is well clear from the Traceback that it is not able to find 'paroles-a-beautiful-lie.html' file. 从Traceback很清楚,它无法找到“ paroles-a-beautiful-lie.html”文件。 I would suggest you to go step by step.
我建议你一步一步走。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.