简体   繁体   中英

simple python question about urlopen

I am trying to make a program that deletes all the tags in html document. So I made a program like this.

import urllib
loc_left = 0
while loc_left != -1 :
    html_code = urllib.urlopen("http://www.python.org/").read()

    loc_left = html_code.find('<')
    loc_right = html_code.find('>')

    str_in_braket = html_code[loc_left, loc_right + 1]

    html_code.replace(str_in_braket, "")

but It showes the error message like below

lee@Lee-Computer:~/pyt$ python html_braket.py
Traceback (most recent call last):
  File "html_braket.py", line 1, in <module>
    import urllib
  File "/usr/lib/python2.6/urllib.py", line 25, in <module>
    import string
  File "/home/lee/pyt/string.py", line 4, in <module>
    html_code = urllib.urlopen("http://www.python.org/").read()
AttributeError: 'module' object has no attribute 'urlopen'

And one thing that is interesting is, what if I typed the code into python, the error above wouldn't show up.

You've named a script string.py . The urllib module imports this, thinking that it's the same string module that's in the stdlib, and then your code uses an attribute on the now partially-defined urllib module that doesn't yet exist. Name your script something else.

Step one is to download the document so you can have it contained in a string:

import urllib
html_code = urllib.urlopen("http://www.python.org/").read() # <-- Note: this does not give me any sort of error

Then you have two pretty nice options which will be robust since they actually parse the HTML document, rather than simply looking for '<' and '>' characters:

Option 1: Use Beautiful Soup

from BeautifulSoup import BeautifulSoup

''.join(BeautifulSoup(page).findAll(text=True))

Option 2: Use the built-in Python HTMLParser class

from HTMLParser import HTMLParser

class TagStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Example using option 2:

In [22]: strip_tags('<html>hi</html>')
Out[22]: 'hi'

If you already have BeautifulSoup available, then that's pretty simple. Pasting in the TagStripper class and strip_tags function is also pretty straightforward.

Good luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM