[英]simple python question about urlopen
I am trying to make a program that deletes all the tags in html document. 我正在尝试制作一个删除html文档中所有标签的程序。 So I made a program like this.
所以我做了一个这样的程序。
import urllib
loc_left = 0
while loc_left != -1 :
html_code = urllib.urlopen("http://www.python.org/").read()
loc_left = html_code.find('<')
loc_right = html_code.find('>')
str_in_braket = html_code[loc_left, loc_right + 1]
html_code.replace(str_in_braket, "")
but It showes the error message like below 但是它显示如下错误消息
lee@Lee-Computer:~/pyt$ python html_braket.py
Traceback (most recent call last):
File "html_braket.py", line 1, in <module>
import urllib
File "/usr/lib/python2.6/urllib.py", line 25, in <module>
import string
File "/home/lee/pyt/string.py", line 4, in <module>
html_code = urllib.urlopen("http://www.python.org/").read()
AttributeError: 'module' object has no attribute 'urlopen'
And one thing that is interesting is, what if I typed the code into python, the error above wouldn't show up. 有趣的是,如果我将代码键入python,上面的错误将不会出现。
You've named a script string.py
. 您已将脚本命名为
string.py
。 The urllib
module imports this, thinking that it's the same string
module that's in the stdlib, and then your code uses an attribute on the now partially-defined urllib
module that doesn't yet exist. urllib
模块将其导入,并认为它与stdlib中的string
模块相同,然后您的代码使用现在不完全定义的部分urllib
模块上的属性。 Name your script something else. 为脚本命名。
Step one is to download the document so you can have it contained in a string: 第一步是下载文档,以便可以将其包含在字符串中:
import urllib
html_code = urllib.urlopen("http://www.python.org/").read() # <-- Note: this does not give me any sort of error
Then you have two pretty nice options which will be robust since they actually parse the HTML document, rather than simply looking for '<' and '>' characters: 然后,您有两个非常不错的选项,它们将很健壮,因为它们实际上是在解析HTML文档,而不是简单地查找'<'和'>'字符:
Option 1: Use Beautiful Soup 选项1:使用精美汤
from BeautifulSoup import BeautifulSoup
''.join(BeautifulSoup(page).findAll(text=True))
Option 2: Use the built-in Python HTMLParser class 选项2:使用内置的Python HTMLParser类
from HTMLParser import HTMLParser
class TagStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
Example using option 2: 使用选项2的示例:
In [22]: strip_tags('<html>hi</html>')
Out[22]: 'hi'
If you already have BeautifulSoup available, then that's pretty simple. 如果您已经有BeautifulSoup可用,那很简单。 Pasting in the TagStripper class and strip_tags function is also pretty straightforward.
粘贴TagStripper类和strip_tags函数也非常简单。
Good luck! 祝好运!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.