简体   繁体   中英

Trouble reading string with non-ascii characters in python 3

I am trying to read images from WikiArt dataset. However, I cannot load some images which contain non-ascii characters: For example: fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg' although the file exists in the directory. I also compared the output string name from os.listdir() and the one from FileNotFoundError: No such file: '/wiki_art_paintings/rescaled_600px_max_side/Expressionism/fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg' by doing 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg' == 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg' . The output is False .

What can be a problem here?

Problem is because in Unicode you can use single character or create some character as combinations of two other charactes and you have both situations in two different places. In one place you have some characters as single characters (with single code) and in other place you have characters as combinatins of two other characters (with two codes). You can see even difference when you use len() for boths strings. In your example one version has lenght 53 and other has 52

It seems you could convert one name to another using unicodedata.normalize() with one of option NFC , NFKC , NFD , NFKD . So you have to test which one will work for you.

In one direction you may need NFC or NFKC , in other direction you may need NFD or NFKD .

You can also use unidecode to create text without native characters: fa(c)lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg but this may not be so useful for you.

import unicodedata
from unidecode import unidecode

a = 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
b = 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'

print('a:', a)
print('b:', b)

print('--- len ---')
print('len(a):', len(a))
print('len(b):', len(b))

print('--- encode ---')
print('a.encode:', a.encode('utf-8'))
print('b.encode:', b.encode('utf-8'))

print('--- a == normalize(b) ---')
print('NFC: ', a == unicodedata.normalize('NFC', b) )
print('NFKC:', a == unicodedata.normalize('NFKC', b) )
print('NFD: ', a == unicodedata.normalize('NFD', b) )
print('NFKD:', a == unicodedata.normalize('NFKD', b) )

print('--- b == normalize(a) ---')
print('NFC: ', b == unicodedata.normalize('NFC', a) )
print('NFKC:', b == unicodedata.normalize('NFKC', a) )
print('NFD: ', b == unicodedata.normalize('NFD', a) )
print('NFKD:', b == unicodedata.normalize('NFKD', a) )

print('--- unidecode ---')
print('a:', unidecode(a))
print('b:', unidecode(b))

Result:

a: fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
b: fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
--- len ---
len(a): 53
len(b): 52
--- encode ---
a.encode: b'fa\xcc\x83\xc2\xa9lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
b.encode: b'f\xc3\xa3\xc2\xa9lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
--- a == normalize(b) ---
NFC:  False
NFKC: False
NFD:  True
NFKD: True
--- b == normalize(a) ---
NFC:  True
NFKC: True
NFD:  False
NFKD: False
--- unidecode ---
a: fa(c)lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
b: fa(c)lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg

I met characters as combination of two other characters only when I have to transfer MacOS files to other system


Doc: unicodedata

Pythonsheet: Unicode

Stackoverflow: Normalizing Unicode

The two strings are not the same. Look:

> ciao='fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'.encode('utf-8')       
> bye='fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'.encode('utf-8')        
> ciao.hex() 
 '6661cc83c2a96c69782d64656c2d6d61726c655f6e752d6167656e6f75696c6c2d7375722d666f6e642d626c65752d313933372e6a7067'
> bye.hex()  
 '66c3a3c2a96c69782d64656c2d6d61726c655f6e752d6167656e6f75696c6c2d7375722d666f6e642d626c65752d313933372e6a7067'
> ciao2='fa'.encode('utf-8')
> bye2='f'.encode('utf-8')
> ciao2.hex()
 '6661'
> bye2.hex() 
 '66'

it seems there is an hidden character around the 'f'. It seems a 'a'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM