简体   繁体   中英

Python encoding problems

So, I've read a lot about Python encoding and stuff - maybe not enough but I've been working on this for 2 days and still nothing - but I'm still getting troubles. I'll try to be as clear as I can. The main thing is that I'm trying to remove all accents and characters such as #, !, %, &...

The thing is, I do a query search on Twitter Search API with this call:

query = urllib2.urlopen(settings.SEARCH_URL + '?%s' % params)

Then, I call a method ( avaliar_pesquisa() ) to evaluate the results I've got, based on the tags (or terms) of the input:

dados = avaliar_pesquisa(simplejson.loads(query.read()), str(tags))

On avaliar_pesquisa() , the following happens:

def avaliar_pesquisa(dados, tags):
    resultados = []
    # Percorre os resultados
    for i in dados['results']
        resultados.append({'texto'          : i['text'],
                           'imagem'         : i['profile_image_url'],
                           'classificacao'  : avaliar_texto(i['text'], tags),
                           'timestamp'      : i['created_at'],
                         })

Note the avaliar_texto() which evaluates the Tweet text. And there's exactly the problem on the following lines:

def avaliar_texto(texto, tags):
    # Remove accents
    from unicodedata import normalize
    def strip_accents(txt):
        return normalize('NFKD', txt.decode('utf-8'))

    # Split
    texto_split = strip_accents(texto)
    texto_split = texto.lower().split()

    # Remove non-alpha characters
    import re
    pattern = re.compile('[\W_]+')
    texto_aux = []
    for i in texto_split:
        texto_aux.append(pattern.sub('', i))
    texto_split = texto_aux

The split doesn't really matter here. The thing is, if I print the type of the var texto on this last method, I may get str or unicode as answer. If there is any kind of accent on the text, it comes like unicode. So, I get this error running the application that receives 100 tweets max as answer:

UnicodeEncodeError: 'ascii' codec can't encode character u'\\xe9' in position 17: ordinal not in range(128)

For the following text:

Text: Agora o problema é com o speedy. type 'unicode'

Any ideas?

See this page .

The decode() method is to be applied to a str object, not a unicode object. Given a unicode string as input, it first tries to encode it to a str using the ascii codec, then decode as utf-8, which fails.

Try return normalize('NFKD', unicode(txt) ) .

这是我在代码中用来丢弃重音等的内容。

text = unicodedata.normalize('NFD', text).encode('ascii','ignore')

Ty placing:

# -*- coding: utf-8 -*-

at the beginning of your python script containing the code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM