简体   繁体   English

使用python清理HTML

[英]Cleaning HTML using python

I have the below code however i am receiving an error. 我有以下代码,但是我收到错误。 I am trying to get the text from an html file between Tag1 and Tag2 without the for loop the code is working (for one file) however when looping in a directory it is not 我正在尝试从Tag1和Tag2之间的html文件中获取文本,而没有for循环,代码正在工作(对于一个文件),但是在目录中循环时却没有

from bs4 import BeautifulSoup
from urllib import urlopen
import os
import bleach
import re
rootdir = mydirectory
for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        url = file
        print url
        raw = urlopen(url).read()
        type(raw)
        Tag1 = raw.find("""<div class="song-text">""")
        Tag2 = raw.rfind("""<div style="text-align:center;padding-bottom:10px;">""")
        Cleaned = raw[Tag1+23:Tag2]
        print Cleaned

Error message: Traceback (most recent call last): File "TestClean.py", line 12, in raw = urlopen(url).read() File "/usr/lib/python2.7/urllib.py", line 87, in urlopen return opener.open(url) File "/usr/lib/python2.7/urllib.py", line 208, in open return getattr(self, name)(url) File "/usr/lib/python2.7/urllib.py", line 463, in open_file return self.open_local_file(url) File "/usr/lib/python2.7/urllib.py", line 477, in open_local_file raise IOError(e.errno, e.strerror, e.filename) IOError: [Errno 2] No such file or directory: 'paroles-a-beautiful-lie.html' 错误消息:追溯(最近一次呼叫最近):文件“ TestClean.py”,行12,原始= urlopen(url).read()文件“ /usr/lib/python2.7/urllib.py”,行87 ,在urlopen中返回opener.open(url)文件“ /usr/lib/python2.7/urllib.py”,第208行,在open中返回getattr(self,name)(url)文件“ / usr / lib / python2。 7 / urllib.py“,第463行,在open_file中返回self.open_local_file(url)文件,” / usr / lib / python2.7 / urllib.py“,第477行,在open_local_file中,引发IOError(e.errno,e.strerror ,例如e.filename)IOError:[Errno 2]没有这样的文件或目录:'paroles-a-beautiful-lie.html'

Error message indicates a lack of file. 错误消息表明缺少文件。 os.walk returns only the name of the file, but not the full path to it. os.walk仅返回文件名,而不返回文件的完整路径。 1) Take the path = os.path.join(subdir, file) 2) Read the file open(path).read() without urlopen 1)取path = os.path.join(subdir, file) 2)读取文件open(path).read()而不使用urlopen

It is well clear from the Traceback that it is not able to find 'paroles-a-beautiful-lie.html' file. 从Traceback很清楚,它无法找到“ paroles-a-beautiful-lie.html”文件。 I would suggest you to go step by step. 我建议你一步一步走。

  1. Comment the code below 'print url'. 注释“打印网址”下面的代码。
  2. Check whether you are getting proper url. 检查您是否获得正确的URL。
  3. Then proceed with your next step - finding process. 然后继续下一步-查找过程。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM