關於urlopen的簡單python問題

Question

我正在嘗試制作一個刪除html文檔中所有標簽的程序。 所以我做了一個這樣的程序。

import urllib
loc_left = 0
while loc_left != -1 :
    html_code = urllib.urlopen("http://www.python.org/").read()

    loc_left = html_code.find('<')
    loc_right = html_code.find('>')

    str_in_braket = html_code[loc_left, loc_right + 1]

    html_code.replace(str_in_braket, "")

但是它顯示如下錯誤消息

lee@Lee-Computer:~/pyt$ python html_braket.py
Traceback (most recent call last):
  File "html_braket.py", line 1, in <module>
    import urllib
  File "/usr/lib/python2.6/urllib.py", line 25, in <module>
    import string
  File "/home/lee/pyt/string.py", line 4, in <module>
    html_code = urllib.urlopen("http://www.python.org/").read()
AttributeError: 'module' object has no attribute 'urlopen'

有趣的是，如果我將代碼鍵入python，上面的錯誤將不會出現。

Answer 1

您已將腳本命名為string.py 。 urllib模塊將其導入，並認為它與stdlib中的string模塊相同，然后您的代碼使用現在不完全定義的部分urllib模塊上的屬性。 為腳本命名。

Answer 2

第一步是下載文檔，以便可以將其包含在字符串中：

import urllib
html_code = urllib.urlopen("http://www.python.org/").read() # <-- Note: this does not give me any sort of error

然后，您有兩個非常不錯的選項，它們將很健壯，因為它們實際上是在解析HTML文檔，而不是簡單地查找'<'和'>'字符：

選項1：使用精美湯

from BeautifulSoup import BeautifulSoup

''.join(BeautifulSoup(page).findAll(text=True))

選項2：使用內置的Python HTMLParser類

from HTMLParser import HTMLParser

class TagStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

使用選項2的示例：

In [22]: strip_tags('<html>hi</html>')
Out[22]: 'hi'

如果您已經有BeautifulSoup可用，那很簡單。 粘貼TagStripper類和strip_tags函數也非常簡單。

祝好運！

關於urlopen的簡單python問題

問題描述

2 個解決方案

解決方案1
5 已采納 2011-02-19 21:16:04

解決方案2
1 2011-02-19 21:16:30

關於urlopen的簡單python問題

問題描述

2 個解決方案

解決方案1 5 已采納 2011-02-19 21:16:04

解決方案2 1 2011-02-19 21:16:30

解決方案1
5 已采納 2011-02-19 21:16:04

解決方案2
1 2011-02-19 21:16:30