I am implementing the app where i have one scenario ,that is to read the file after normalising it but while reading the file i am getting the following error : Below is my Try
def unicodeToAscii(self,s):
return ''.join(c for c in unicodedata.normalize('NFD',s) if unicodedata.category(c)!='Mn')
def normalizeString(self,s):
s=self.unicodeToAscii(s.lower().strip())
s=re.sub(r"([.!?])",r" \1",s)
s=re.sub(r"([^a-zA-Z.!?])",r" ",s)
s=re.sub(r"(\s+)",r" ",s).strip()
return s
dataFile=os.path.join('/home/amit/Downloads/cornell_movie_dialogs_corpus/cornell movie-dialogs corpus','formatted_movie_lines')
print('please wait .. reading a file')
lines =open(dataFile).read().strip().split('\n')
vocal=Vocabulary()
pairs=[[vocal.normalizeString(unicode(s))for s in pair.split('\t')] for pair in lines]
print('done reading')
Error:
please wait .. reading a file
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-2-4142a7dbef84> in <module>()
118 lines =open(dataFile).read().strip().split('\n')
119 vocal=Vocabulary()
--> 120 pairs=[[vocal.normalizeString(unicode(s))for s in pair.split('\t')] for pair in lines]
121 print('done reading')
122
UnicodeDecodeError: 'ascii' codec can't decode byte 0xad in position 28: ordinal not in range(128)
The Unicode normalization you are performing is not converting everything to ASCII. It simply applies a Unicode normalization which makes sure that variant encodings all are expressed the same way. (In addition, you are avoiding this for the Mn
category, so the normalization is incomplete, too.)
For what it's worth, U+00AD is a soft hyphen, which -- like the vast majority of Unicode characters -- does not have a corresponding pure ASCII character, though you could approximate it with a regular dash/minus/hyphen -
. The built-in 'replace'
functionality will simply replace it with a question mark, though:
>>> '\00ad'.encode('ascii', 'replace')
b'?'
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.