[英]Trouble reading string with non-ascii characters in python 3
I am trying to read images from WikiArt dataset.我正在尝试从 WikiArt 数据集中读取图像。 However, I cannot load some images which contain non-ascii characters: For example: fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg' although the file exists in the directory.
但是,我无法加载一些包含非 ascii 字符的图像:例如: fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'尽管该文件存在于目录中。 I also compared the output string name from
os.listdir()
and the one from FileNotFoundError: No such file: '/wiki_art_paintings/rescaled_600px_max_side/Expressionism/fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
by doing 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg' == 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
.我还比较了来自
os.listdir()
的 output 字符串名称和来自FileNotFoundError: No such file: '/wiki_art_paintings/rescaled_600px_max_side/Expressionism/fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
通过执行'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg' == 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
。 The output is False . output 是False 。
What can be a problem here?这里有什么问题?
Problem is because in Unicode
you can use single character or create some character as combinations of two other charactes and you have both situations in two different places.问题是因为在
Unicode
中,您可以使用单个字符或创建一些字符作为其他两个字符的组合,并且您在两个不同的地方都有这两种情况。 In one place you have some characters as single characters (with single code) and in other place you have characters as combinatins of two other characters (with two codes).在一个地方,您将一些字符作为单个字符(带有单个代码),而在另一个地方,您将字符作为两个其他字符的组合(带有两个代码)。 You can see even difference when you use
len()
for boths strings.当您将
len()
用于两个字符串时,您甚至可以看到差异。 In your example one version has lenght 53
and other has 52
在您的示例中,一个版本的长度为
53
,另一个版本的长度为52
It seems you could convert one name to another using unicodedata.normalize()
with one of option NFC
, NFKC
, NFD
, NFKD
.看来您可以使用
unicodedata.normalize()
和选项NFC
、 NFKC
、 NFD
、 NFKD
之一将一个名称转换为另一个名称。 So you have to test which one will work for you.所以你必须测试哪一个对你有用。
In one direction you may need NFC
or NFKC
, in other direction you may need NFD
or NFKD
.在一个方向上,您可能需要
NFC
或NFKC
,在另一个方向上,您可能需要NFD
或NFKD
。
You can also use unidecode
to create text without native characters: fa(c)lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
but this may not be so useful for you.您还可以使用
unidecode
创建没有本地字符的文本: fa(c)lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
但这对您来说可能不是那么有用。
import unicodedata
from unidecode import unidecode
a = 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
b = 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
print('a:', a)
print('b:', b)
print('--- len ---')
print('len(a):', len(a))
print('len(b):', len(b))
print('--- encode ---')
print('a.encode:', a.encode('utf-8'))
print('b.encode:', b.encode('utf-8'))
print('--- a == normalize(b) ---')
print('NFC: ', a == unicodedata.normalize('NFC', b) )
print('NFKC:', a == unicodedata.normalize('NFKC', b) )
print('NFD: ', a == unicodedata.normalize('NFD', b) )
print('NFKD:', a == unicodedata.normalize('NFKD', b) )
print('--- b == normalize(a) ---')
print('NFC: ', b == unicodedata.normalize('NFC', a) )
print('NFKC:', b == unicodedata.normalize('NFKC', a) )
print('NFD: ', b == unicodedata.normalize('NFD', a) )
print('NFKD:', b == unicodedata.normalize('NFKD', a) )
print('--- unidecode ---')
print('a:', unidecode(a))
print('b:', unidecode(b))
Result:结果:
a: fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
b: fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
--- len ---
len(a): 53
len(b): 52
--- encode ---
a.encode: b'fa\xcc\x83\xc2\xa9lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
b.encode: b'f\xc3\xa3\xc2\xa9lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
--- a == normalize(b) ---
NFC: False
NFKC: False
NFD: True
NFKD: True
--- b == normalize(a) ---
NFC: True
NFKC: True
NFD: False
NFKD: False
--- unidecode ---
a: fa(c)lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
b: fa(c)lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
I met characters as combination of two other characters only when I have to transfer MacOS files to other system仅当我必须将 MacOS 文件传输到其他系统时,我才遇到字符作为其他两个字符的组合
Doc: unicodedata文档: unicodedata
Pythonsheet: Unicode Pythonsheet: Unicode
Stackoverflow: Normalizing Unicode Stackoverflow: 规范化 Unicode
The two strings are not the same.两个字符串不一样。 Look:
看:
> ciao='fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'.encode('utf-8')
> bye='fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'.encode('utf-8')
> ciao.hex()
'6661cc83c2a96c69782d64656c2d6d61726c655f6e752d6167656e6f75696c6c2d7375722d666f6e642d626c65752d313933372e6a7067'
> bye.hex()
'66c3a3c2a96c69782d64656c2d6d61726c655f6e752d6167656e6f75696c6c2d7375722d666f6e642d626c65752d313933372e6a7067'
> ciao2='fa'.encode('utf-8')
> bye2='f'.encode('utf-8')
> ciao2.hex()
'6661'
> bye2.hex()
'66'
it seems there is an hidden character around the 'f'.似乎'f'周围有一个隐藏的字符。 It seems a 'a'
这似乎是一个'a'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.