I need to normalize text from Italian wiki using python3 and nltk and I've got one problem. Most of the words are OK, but some words are mapped incorrect, to be more exact - some symbols.
For example:
'fruibilit\\xe3', 'n\\xe2\\xba', 'citt\\xe3'
I'm sure that the problem is in symbols like à, è.
Code:
# coding: utf8
import os
from nltk import corpus, word_tokenize, ConditionalFreqDist
it_sw_plus = corpus.stopwords.words('italian') + ['doc', 'https']
#it_folder_names = ['AA', 'AB', 'AC', 'AD', 'AE', 'AF']
it_path = os.listdir('C:\\Users\\1\\projects\\i')
it_corpora = []
def normalize(raw_text):
tokens = word_tokenize(raw_text)
norm_tokens = []
for token in tokens:
if token not in it_sw_plus and token.isalpha():
token = token.lower().encode('utf8')
norm_tokens.append(token)
return norm_tokens
for folder_name in it_path:
path_to_files = 'C:\\Users\\1\\projects\\i\\%s' % (folder_name)
files_list = os.listdir(path_to_files)
for file_name in files_list:
file_path = path_to_files + '\\' + file_name
text_file = open(file_path)
raw_text = text_file.read().decode('utf8')
norm_tokens = normalize(raw_text)
it_corpora.append(norm_tokens)
print(it_corpora)
How can I resolve this problem? I'm running on Win7(rus).
When I try this code:
import io
with open('C:\\Users\\1\\projects\\i\\AA\\wiki_00', 'r', encoding='utf8') as fin:
for line in fin:
print (line)
In PowerShell:
<doc id="2" url="https://it.wikipedia.org/wiki?curid=2" title="Armonium">
Armonium
Traceback (most recent call last):
File "i.py", line 5, in <module>
print (line)
File "C:\Python35-32\lib\encodings\cp866.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 3: character maps to <undefined>
In Python command line:
<doc id="2" url="https://it.wikipedia.org/wiki?curid=2" title="Armonium">
Armonium
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\1\projects\i.py", line 5, in <module>
print (line)
File "C:\Python35-32\lib\encodings\cp866.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position
3: character maps to <undefined>
When I try the request:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python35-32\lib\encodings\cp866.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position
90: character maps to <undefined>
Try specifying the encoding when reading the file if you know the encoding, in python2
import io
with io.open(filename, 'r', encoding='latin-1') as fin:
for line in fin:
print line # line should be encoded as latin-1
But in your case, the file you've posted isn't a latin1
file but a utf8
file, in python3
:
>>> import urllib.request
>>> url = 'https://raw.githubusercontent.com/GiteItAwayNow/TrueTry/master/it'
>>> response = urllib.request.urlopen(url)
>>> data = response.read()
>>> text = data.decode('utf8')
>>> print (text) # this prints the file perfectly.
To read a 'utf8' file in python2
:
import io
with io.open(filename, 'r', encoding='utf8') as fin:
for line in fin:
print (line) # line should be encoded as utf8
To read a 'utf8' file, in python3
:
with open(filename, 'r', encoding='utf8') as fin:
for line in fin:
print (line) # line should be encoded as utf8
As a good practice, when dealing with text data, try to use unicode and python3 whenever possible. Do take a look at
Additionally, if you haven't install this module for printing utf8 on windows console, you should try it:
pip install win-unicode-console
Or download this: https://pypi.python.org/packages/source/w/win_unicode_console/win_unicode_console-0.4.zip and then python setup.py install
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.