使用python清理HTML

Question

I have the below code however i am receiving an error. 我有以下代码，但是我收到错误。 I am trying to get the text from an html file between Tag1 and Tag2 without the for loop the code is working (for one file) however when looping in a directory it is not 我正在尝试从Tag1和Tag2之间的html文件中获取文本，而没有for循环，代码正在工作（对于一个文件），但是在目录中循环时却没有

from bs4 import BeautifulSoup
from urllib import urlopen
import os
import bleach
import re
rootdir = mydirectory
for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        url = file
        print url
        raw = urlopen(url).read()
        type(raw)
        Tag1 = raw.find("""<div class="song-text">""")
        Tag2 = raw.rfind("""<div style="text-align:center;padding-bottom:10px;">""")
        Cleaned = raw[Tag1+23:Tag2]
        print Cleaned

Error message: Traceback (most recent call last): File "TestClean.py", line 12, in raw = urlopen(url).read() File "/usr/lib/python2.7/urllib.py", line 87, in urlopen return opener.open(url) File "/usr/lib/python2.7/urllib.py", line 208, in open return getattr(self, name)(url) File "/usr/lib/python2.7/urllib.py", line 463, in open_file return self.open_local_file(url) File "/usr/lib/python2.7/urllib.py", line 477, in open_local_file raise IOError(e.errno, e.strerror, e.filename) IOError: [Errno 2] No such file or directory: 'paroles-a-beautiful-lie.html' 错误消息：追溯（最近一次呼叫最近）：文件“ TestClean.py”，行12，原始= urlopen（url）.read（）文件“ /usr/lib/python2.7/urllib.py”，行87 ，在urlopen中返回opener.open（url）文件“ /usr/lib/python2.7/urllib.py”，第208行，在open中返回getattr（self，name）（url）文件“ / usr / lib / python2。 7 / urllib.py“，第463行，在open_file中返回self.open_local_file（url）文件，” / usr / lib / python2.7 / urllib.py“，第477行，在open_local_file中，引发IOError（e.errno，e.strerror ，例如e.filename）IOError：[Errno 2]没有这样的文件或目录：'paroles-a-beautiful-lie.html'

Answer 1

Error message indicates a lack of file. 错误消息表明缺少文件。 os.walk returns only the name of the file, but not the full path to it. os.walk仅返回文件名，而不返回文件的完整路径。 1) Take the path = os.path.join(subdir, file) 2) Read the file open(path).read() without urlopen 1）取path = os.path.join(subdir, file) 2）读取文件open(path).read()而不使用urlopen

Answer 2

It is well clear from the Traceback that it is not able to find 'paroles-a-beautiful-lie.html' file. 从Traceback很清楚，它无法找到“ paroles-a-beautiful-lie.html”文件。 I would suggest you to go step by step. 我建议你一步一步走。

Comment the code below 'print url'. 注释“打印网址”下面的代码。
Check whether you are getting proper url. 检查您是否获得正确的URL。
Then proceed with your next step - finding process. 然后继续下一步-查找过程。

使用python清理HTML

问题描述

2 个解决方案

解决方案1
0 2014-05-23 07:21:14

解决方案2
0 2014-05-23 07:21:57

使用python清理HTML

问题描述

2 个解决方案

解决方案1 0 2014-05-23 07:21:14

解决方案2 0 2014-05-23 07:21:57

解决方案1
0 2014-05-23 07:21:14

解决方案2
0 2014-05-23 07:21:57