简体   繁体   English

关于urlopen的简单python问题

[英]simple python question about urlopen

I am trying to make a program that deletes all the tags in html document. 我正在尝试制作一个删除html文档中所有标签的程序。 So I made a program like this. 所以我做了一个这样的程序。

import urllib
loc_left = 0
while loc_left != -1 :
    html_code = urllib.urlopen("http://www.python.org/").read()

    loc_left = html_code.find('<')
    loc_right = html_code.find('>')

    str_in_braket = html_code[loc_left, loc_right + 1]

    html_code.replace(str_in_braket, "")

but It showes the error message like below 但是它显示如下错误消息

lee@Lee-Computer:~/pyt$ python html_braket.py
Traceback (most recent call last):
  File "html_braket.py", line 1, in <module>
    import urllib
  File "/usr/lib/python2.6/urllib.py", line 25, in <module>
    import string
  File "/home/lee/pyt/string.py", line 4, in <module>
    html_code = urllib.urlopen("http://www.python.org/").read()
AttributeError: 'module' object has no attribute 'urlopen'

And one thing that is interesting is, what if I typed the code into python, the error above wouldn't show up. 有趣的是,如果我将代码键入python,上面的错误将不会出现。

You've named a script string.py . 您已将脚本命名为string.py The urllib module imports this, thinking that it's the same string module that's in the stdlib, and then your code uses an attribute on the now partially-defined urllib module that doesn't yet exist. urllib模块将其导入,并认为它与stdlib中的string模块相同,然后您的代码使用现在不完全定义的部分urllib模块上的属性。 Name your script something else. 为脚本命名。

Step one is to download the document so you can have it contained in a string: 第一步是下载文档,以便可以将其包含在字符串中:

import urllib
html_code = urllib.urlopen("http://www.python.org/").read() # <-- Note: this does not give me any sort of error

Then you have two pretty nice options which will be robust since they actually parse the HTML document, rather than simply looking for '<' and '>' characters: 然后,您有两个非常不错的选项,它们将很健壮,因为它们实际上是在解析HTML文档,而不是简单地查找'<'和'>'字符:

Option 1: Use Beautiful Soup 选项1:使用精美汤

from BeautifulSoup import BeautifulSoup

''.join(BeautifulSoup(page).findAll(text=True))

Option 2: Use the built-in Python HTMLParser class 选项2:使用内置的Python HTMLParser类

from HTMLParser import HTMLParser

class TagStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Example using option 2: 使用选项2的示例:

In [22]: strip_tags('<html>hi</html>')
Out[22]: 'hi'

If you already have BeautifulSoup available, then that's pretty simple. 如果您已经有BeautifulSoup可用,那很简单。 Pasting in the TagStripper class and strip_tags function is also pretty straightforward. 粘贴TagStripper类和strip_tags函数也非常简单。

Good luck! 祝好运!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM