简体   繁体   中英

Get content-type from HTML page with BeautifulSoup

I am trying to get the character encoding for pages that I scrape, but in some cases it is failing. Here is what I am doing:

resp = urllib2.urlopen(request)
self.COOKIE_JAR.extract_cookies(resp, request)
content = resp.read()
encodeType= resp.headers.getparam('charset')
resp.close()

That is my first attempt. But if charset comes back as type None , I do this:

soup = BeautifulSoup(html)
if encodeType == None:
    try:
        encodeType = soup.findAll('meta', {'http-equiv':lambda v:v.lower()=='content-type'})
    except AttributeError, e:
        print e
        try:
            encodeType = soup.findAll('meta', {'charset':lambda v:v.lower() != None})
        except AttributeError, e:
            print e
            if encodeType == '':
                encodeType = 'iso-8859-1'

The page I am testing has this in the header: <meta charset="ISO-8859-1">

I would expect the first try statement to return an empty string, but I get this error on both try statements (which is why the 2nd statement is nested for now):

'NoneType' object has no attribute 'lower'

What is wrong with the 2nd try statement? I am guessing the 1st one is incorrect as well, since it's throwing an error and not just coming back blank.

OR better yet is there a more elegant way to just remove any special character encoding from a page? My end result I am trying to accomplish is that I don't care about any of the specially encoded characters. I want to delete encoded characters and keep the raw text. Can I skip all of the above an tell BeautifulSoup to just strip anything that is encoded?

I decided to just go with whatever BeautifulSoup spits out. Then as I parse through each word in the document, if I can't convert it to a string, I just disregard it.

for word in doc.lower().split(): 
        try:
            word = str(word)
            word = self.handlePunctuation(word)
            if word == False:
                continue
        except UnicodeEncodeError, e:
            #word couldn't be converted to string; most likely encoding garbage we can toss anyways
            continue 

When attempting to determine the character encoding of a page, I believe the order that should be tried is:

  1. Determine from the HTML page itself via meta tags (eg <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> )
  2. Determine encoding via the HTTP headers (eg Content-Type: text/html; charset=ISO-8859-1 )
  3. Finally, if the above don't yield anything, you can do something like use an algorithm to determine the character encoding of a page using the distribution of bytes within it (note that isn't guaranteed to find the right encoding). Check out the chardet library for this option.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM