简体   繁体   中英

Cant' get correct representation for some Italian words

I need to normalize text from Italian wiki using python3 and nltk and I've got one problem. Most of the words are OK, but some words are mapped incorrect, to be more exact - some symbols.

For example:

'fruibilit\\xe3', 'n\\xe2\\xba', 'citt\\xe3'

I'm sure that the problem is in symbols like à, è.

Code:

# coding: utf8
import os

from nltk import corpus, word_tokenize, ConditionalFreqDist


it_sw_plus = corpus.stopwords.words('italian') + ['doc', 'https']
#it_folder_names = ['AA', 'AB', 'AC', 'AD', 'AE', 'AF']
it_path = os.listdir('C:\\Users\\1\\projects\\i')
it_corpora = []

def normalize(raw_text):
    tokens = word_tokenize(raw_text)
    norm_tokens = []
    for token in tokens:
        if token not in it_sw_plus and token.isalpha():
            token = token.lower().encode('utf8')
            norm_tokens.append(token)
    return norm_tokens

for folder_name in it_path:
    path_to_files = 'C:\\Users\\1\\projects\\i\\%s' % (folder_name)
    files_list = os.listdir(path_to_files)
    for file_name in files_list:
        file_path = path_to_files + '\\' + file_name
        text_file = open(file_path)
        raw_text = text_file.read().decode('utf8')
        norm_tokens = normalize(raw_text)
        it_corpora.append(norm_tokens)
    print(it_corpora)

How can I resolve this problem? I'm running on Win7(rus).

When I try this code:

import io

with open('C:\\Users\\1\\projects\\i\\AA\\wiki_00', 'r', encoding='utf8') as fin:
    for line in fin:
        print (line) 

In PowerShell:

    <doc id="2" url="https://it.wikipedia.org/wiki?curid=2" title="Armonium">

Armonium



Traceback (most recent call last):
  File "i.py", line 5, in <module>
    print (line)
  File "C:\Python35-32\lib\encodings\cp866.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 3: character maps to <undefined>

In Python command line:

<doc id="2" url="https://it.wikipedia.org/wiki?curid=2" title="Armonium">

Armonium



Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\1\projects\i.py", line 5, in <module>
    print (line)
  File "C:\Python35-32\lib\encodings\cp866.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position
3: character maps to <undefined>

When I try the request:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python35-32\lib\encodings\cp866.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position
90: character maps to <undefined>

Try specifying the encoding when reading the file if you know the encoding, in python2

import io
with io.open(filename, 'r', encoding='latin-1') as fin:
    for line in fin:
        print line # line should be encoded as latin-1

But in your case, the file you've posted isn't a latin1 file but a utf8 file, in python3 :

>>> import urllib.request
>>> url = 'https://raw.githubusercontent.com/GiteItAwayNow/TrueTry/master/it'
>>> response = urllib.request.urlopen(url)
>>> data = response.read()
>>> text = data.decode('utf8')
>>> print (text) # this prints the file perfectly.

To read a 'utf8' file in python2 :

import io
with io.open(filename, 'r', encoding='utf8') as fin:
    for line in fin:
        print (line) # line should be encoded as utf8

To read a 'utf8' file, in python3 :

with open(filename, 'r', encoding='utf8') as fin:
    for line in fin:
        print (line) # line should be encoded as utf8

As a good practice, when dealing with text data, try to use unicode and python3 whenever possible. Do take a look at

Additionally, if you haven't install this module for printing utf8 on windows console, you should try it:

pip install win-unicode-console

Or download this: https://pypi.python.org/packages/source/w/win_unicode_console/win_unicode_console-0.4.zip and then python setup.py install

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM