简体   繁体   中英

Dictionary keys cannot be encoded as utf-8

I am using the twitter streaming api (tweepy) to capture several tweets. I do this in python2.7.

After I have collected a corpus of tweets I break each tweet into words and add each word to a dictionary as keys, where the values are the participation of each word in positive or negative sentences.

When I retrieve the words as keys of the dictionary and try to process them for a next iteration I get

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2: ordinal not in range(128)

errors

The weird thing is that before I place them as dictionary keys I encode them without errors. Here is a sample code

pos = {}
neg = {}
for status in corpus:
    p = s.analyze(status).polarity
    words = []
    # gather real words
    for w in status.split(' '):
        try:
            words.append(w.encode('utf-8'))
        except UnicodeDecodeError as e:
            print(e)
    # assign sentiment of the sentence to the words
    for w in words:
        if w not in pos:
            pos[w] = 0
            neg[w] = 0

        if p >= 0:                    
            pos[w] += 1
        else:
            neg[w] += 1

k = pos.keys()
k = [i.encode('utf-8') for i in k]  # <-- for this line a get an error
p = [v for i, v in pos.items()]
n = [v for i, v in neg.items()]

So this piece of code will catch no errors during the splitting of the words but it will throw an error when trying to encode the keys again. I should note than normally I wouldn't try to encode the keys again, as I would think they are already properly encoded. But I added this extra encoding to narrow down the source of the error.

Am I missing something? Do you see anything wrong with my code?

to avoid confusion here is a sample code more close to the original that is not trying to encode the keys again

k = ['happy']
for i in range(3):
    print('sampling twitter --> {}'.format(i))
    myStream.filter(track=k)  # <-- this is where I will receive the error in the second iteration
    for status in corpus:
        p = s.analyze(status).polarity
        words = []
        # gather real words
        for w in status.split(' '):
            try:
                words.append(w.encode('utf-8'))
            except UnicodeDecodeError as e:
                print(e)
        # assign sentiment of the sentence to the words
        for w in words:
            if w not in pos:
                pos[w] = 0
                neg[w] = 0

            if p >= 0:                    
                pos[w] += 1
            else:
                neg[w] += 1

    k = pos.keys()

( please suggest a better title for the question )

Note that the error message says "'ascii' codec can't decode ...". That's because when you call encode on something that is already a bytestring in Python 2, it tries to decode it to Unicode first using the default codec.

I'm not sure why you thought that encoding again would be a good idea. Don't do it; the strings are already byetestrings, leave them as that.

You get a decode error while you are trying to encode a string. This seems weird but it is due to implicit decode/encode mechanism of Python.

Python allows to encode strings to obtain bytes and decode bytes to obtain strings. This means that Python can encode only strings and decode only bytes.

So when you try to encode bytes, Python (which does not know how to encode bytes) tries to implicitely decode the byte to obtain a string to encode and it uses its default encoding to do that. This is why you get a decode error while trying to encode something: the implicit decoding.

That means that you are probably trying to encode something which is already encoded.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM