简体   繁体   中英

Trouble converting a string from Unicode in Python 2.7?

I'm extremely confused over unicode in Python 2.x.

I'm using BeautifulSoup to scrape a webpage, and I'm trying to insert the things I find into a dictionary with the name as the key, and the url as the value.

I'm using BeautifulSoup's find function to get the info I need. My code started out as follows:

name = i.find('a').string
url = i.find('a').get('href')

This works, with the exception of the thign returned from find is an Object, and not a string.

Here's were things start confusing me

If I try to convert it to type str before I assign it to the variable, it sometimes throws an UnicodeEncodeError .

'ascii' codec can't encode character u'\xa0' in position 5: ordinal not in range(128)

I Google around and find that I should be encoding to ascii

I try adding:

print str(i.find('a').string).encode('ascii', 'ignore')

No luck, still gives an, Unicode Error.

From there, I tried using repr .

print repr(i.find('a').string)

And that works... almost!

I ran into a new problem here.

Once everything is said and done, and the dictionary is built, I can't bloody access anything! It keeps giving me a KeyError .

I can loop over the dict:

for i in sorted(data.iterkeys()):
    print i


>>> u'Key1'
>>> u'Key2'
>>> u'Key3'
>>> u'Key4'

but if I try to access an item of the dict like this:

print data['key1']

OR

print data[u'key1']

OR

test = unicode('key1')
print data[test]

They all return KeyErrors, which is 100% confusing to me. I assume it's got something to do with them being Unicode objects.

I've tried just about everything I can come up with, but I can't figure out what's going on.

Oh! Adding to the oddity, is that this code:

name = repr(i.find('a').string)
print type(name)

returns

>>> type(str)

but if I just print the thing

print name

it shows it as a unicode string

>>>> u'string name' 

The .string value is indeed not a string. You need to cast it to unicode() :

name = unicode(i.find('a').string)

It's a unicode- like object called NavigableString . If you really need it to be a str instead, you can encode it from there:

name = unicode(i.find('a').string).encode('utf8')

or similar. For use in a dict I'd use unicode() objects and not encode.

To understand the difference between unicode() and str() and what encoding to use, I recommend you read the Python Unicode HOWTO .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM