简体   繁体   中英

'ascii' codec can't decode byte 0xad in position 28: ordinal not in range(128)

I am implementing the app where i have one scenario ,that is to read the file after normalising it but while reading the file i am getting the following error : Below is my Try

  def unicodeToAscii(self,s):
        return ''.join(c for c in unicodedata.normalize('NFD',s) if unicodedata.category(c)!='Mn')

def normalizeString(self,s):
        s=self.unicodeToAscii(s.lower().strip())
        s=re.sub(r"([.!?])",r" \1",s)
        s=re.sub(r"([^a-zA-Z.!?])",r" ",s)
        s=re.sub(r"(\s+)",r" ",s).strip()
        return s

dataFile=os.path.join('/home/amit/Downloads/cornell_movie_dialogs_corpus/cornell movie-dialogs corpus','formatted_movie_lines')
print('please wait .. reading a file') 

lines =open(dataFile).read().strip().split('\n')
vocal=Vocabulary()
pairs=[[vocal.normalizeString(unicode(s))for s in pair.split('\t')] for pair in lines]
print('done reading')

Error:

please wait .. reading a file
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-2-4142a7dbef84> in <module>()
    118 lines =open(dataFile).read().strip().split('\n')
    119 vocal=Vocabulary()
--> 120 pairs=[[vocal.normalizeString(unicode(s))for s in pair.split('\t')] for pair in lines]
    121 print('done reading')
    122 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xad in position 28: ordinal not in range(128)

The Unicode normalization you are performing is not converting everything to ASCII. It simply applies a Unicode normalization which makes sure that variant encodings all are expressed the same way. (In addition, you are avoiding this for the Mn category, so the normalization is incomplete, too.)

For what it's worth, U+00AD is a soft hyphen, which -- like the vast majority of Unicode characters -- does not have a corresponding pure ASCII character, though you could approximate it with a regular dash/minus/hyphen - . The built-in 'replace' functionality will simply replace it with a question mark, though:

>>> '\00ad'.encode('ascii', 'replace')
b'?'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM