I am trying to make a program that deletes all the tags in html document. So I made a program like this.
import urllib
loc_left = 0
while loc_left != -1 :
html_code = urllib.urlopen("http://www.python.org/").read()
loc_left = html_code.find('<')
loc_right = html_code.find('>')
str_in_braket = html_code[loc_left, loc_right + 1]
html_code.replace(str_in_braket, "")
but It showes the error message like below
lee@Lee-Computer:~/pyt$ python html_braket.py
Traceback (most recent call last):
File "html_braket.py", line 1, in <module>
import urllib
File "/usr/lib/python2.6/urllib.py", line 25, in <module>
import string
File "/home/lee/pyt/string.py", line 4, in <module>
html_code = urllib.urlopen("http://www.python.org/").read()
AttributeError: 'module' object has no attribute 'urlopen'
And one thing that is interesting is, what if I typed the code into python, the error above wouldn't show up.
You've named a script string.py
. The urllib
module imports this, thinking that it's the same string
module that's in the stdlib, and then your code uses an attribute on the now partially-defined urllib
module that doesn't yet exist. Name your script something else.
Step one is to download the document so you can have it contained in a string:
import urllib
html_code = urllib.urlopen("http://www.python.org/").read() # <-- Note: this does not give me any sort of error
Then you have two pretty nice options which will be robust since they actually parse the HTML document, rather than simply looking for '<' and '>' characters:
Option 1: Use Beautiful Soup
from BeautifulSoup import BeautifulSoup
''.join(BeautifulSoup(page).findAll(text=True))
Option 2: Use the built-in Python HTMLParser class
from HTMLParser import HTMLParser
class TagStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
Example using option 2:
In [22]: strip_tags('<html>hi</html>')
Out[22]: 'hi'
If you already have BeautifulSoup available, then that's pretty simple. Pasting in the TagStripper class and strip_tags function is also pretty straightforward.
Good luck!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.