简体   繁体   English

字典键不能编码为utf-8

[英]Dictionary keys cannot be encoded as utf-8

I am using the twitter streaming api (tweepy) to capture several tweets. 我正在使用Twitter流API(tweepy)捕获多个tweet。 I do this in python2.7. 我在python2.7中做到这一点。

After I have collected a corpus of tweets I break each tweet into words and add each word to a dictionary as keys, where the values are the participation of each word in positive or negative sentences. 收集了一系列推文之后,我将每个推文分解成多个单词,并将每个单词添加到字典中作为键,其中的值是每个单词在positivenegative句子中的参与度。

When I retrieve the words as keys of the dictionary and try to process them for a next iteration I get 当我检索单词作为字典的键并尝试对其进行下一次迭代处理时,我得到

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2: ordinal not in range(128) UnicodeDecodeError:'ascii'编解码器无法解码位置2的字节0xe2:序数不在范围内(128)

errors 错误

The weird thing is that before I place them as dictionary keys I encode them without errors. 奇怪的是,在将它们放置为字典键之前,我对它们进行了编码而没有错误。 Here is a sample code 这是示例代码

pos = {}
neg = {}
for status in corpus:
    p = s.analyze(status).polarity
    words = []
    # gather real words
    for w in status.split(' '):
        try:
            words.append(w.encode('utf-8'))
        except UnicodeDecodeError as e:
            print(e)
    # assign sentiment of the sentence to the words
    for w in words:
        if w not in pos:
            pos[w] = 0
            neg[w] = 0

        if p >= 0:                    
            pos[w] += 1
        else:
            neg[w] += 1

k = pos.keys()
k = [i.encode('utf-8') for i in k]  # <-- for this line a get an error
p = [v for i, v in pos.items()]
n = [v for i, v in neg.items()]

So this piece of code will catch no errors during the splitting of the words but it will throw an error when trying to encode the keys again. 因此,这段代码在单词拆分期间不会捕获任何错误,但是在尝试再次对键进行编码时会抛出错误。 I should note than normally I wouldn't try to encode the keys again, as I would think they are already properly encoded. 我应该注意,通常,我不会再对密钥进行编码,因为我认为它们已经正确编码了。 But I added this extra encoding to narrow down the source of the error. 但是我添加了这种额外的编码,以缩小错误的来源。

Am I missing something? 我想念什么吗? Do you see anything wrong with my code? 您发现我的代码有什么问题吗?

to avoid confusion here is a sample code more close to the original that is not trying to encode the keys again 为避免混淆,此处的示例代码与原始代码更接近,不再尝试再次对密钥进行编码

k = ['happy']
for i in range(3):
    print('sampling twitter --> {}'.format(i))
    myStream.filter(track=k)  # <-- this is where I will receive the error in the second iteration
    for status in corpus:
        p = s.analyze(status).polarity
        words = []
        # gather real words
        for w in status.split(' '):
            try:
                words.append(w.encode('utf-8'))
            except UnicodeDecodeError as e:
                print(e)
        # assign sentiment of the sentence to the words
        for w in words:
            if w not in pos:
                pos[w] = 0
                neg[w] = 0

            if p >= 0:                    
                pos[w] += 1
            else:
                neg[w] += 1

    k = pos.keys()

( please suggest a better title for the question ) 请为这个问题建议一个更好的标题

Note that the error message says "'ascii' codec can't decode ...". 请注意,错误消息显示“'ascii'编解码器无法解码 ...”。 That's because when you call encode on something that is already a bytestring in Python 2, it tries to decode it to Unicode first using the default codec. 这是因为当您在Python 2中已经对字节字符串encode时,它将尝试首先使用默认编解码器将其解码为Unicode。

I'm not sure why you thought that encoding again would be a good idea. 我不确定您为什么认为再次编码将是一个好主意。 Don't do it; 不要做 the strings are already byetestrings, leave them as that. 字符串已经是byetestrings,就这样吧。

You get a decode error while you are trying to encode a string. 尝试编码字符串时收到解码错误。 This seems weird but it is due to implicit decode/encode mechanism of Python. 这似乎很奇怪,但这是由于Python的隐式解码/编码机制所致。

Python allows to encode strings to obtain bytes and decode bytes to obtain strings. Python允许对字符串进行编码以获得字节,而对字节进行解码以获得字符串。 This means that Python can encode only strings and decode only bytes. 这意味着Python只能编码字符串,而只能解码字节。

So when you try to encode bytes, Python (which does not know how to encode bytes) tries to implicitely decode the byte to obtain a string to encode and it uses its default encoding to do that. 因此,当您尝试对字节进行编码时,Python(不知道如何对字节进行编码)会尝试隐式地对字节进行解码以获得要编码的字符串,并使用其默认编码来做到这一点。 This is why you get a decode error while trying to encode something: the implicit decoding. 这就是为什么在尝试对某些内容进行编码时会发生解码错误的原因:隐式解码。

That means that you are probably trying to encode something which is already encoded. 这意味着您可能正在尝试对已经编码的内容进行编码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM