简体   繁体   English

在 python 3 中无法读取带有非 ascii 字符的字符串

[英]Trouble reading string with non-ascii characters in python 3

I am trying to read images from WikiArt dataset.我正在尝试从 WikiArt 数据集中读取图像。 However, I cannot load some images which contain non-ascii characters: For example: fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg' although the file exists in the directory.但是,我无法加载一些包含非 ascii 字符的图像:例如: fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'尽管该文件存在于目录中。 I also compared the output string name from os.listdir() and the one from FileNotFoundError: No such file: '/wiki_art_paintings/rescaled_600px_max_side/Expressionism/fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg' by doing 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg' == 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg' .我还比较了来自os.listdir()的 output 字符串名称和来自FileNotFoundError: No such file: '/wiki_art_paintings/rescaled_600px_max_side/Expressionism/fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'通过执行'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg' == 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg' The output is False . output 是False

What can be a problem here?这里有什么问题?

Problem is because in Unicode you can use single character or create some character as combinations of two other charactes and you have both situations in two different places.问题是因为在Unicode中,您可以使用单个字符或创建一些字符作为其他两个字符的组合,并且您在两个不同的地方都有这两种情况。 In one place you have some characters as single characters (with single code) and in other place you have characters as combinatins of two other characters (with two codes).在一个地方,您将一些字符作为单个字符(带有单个代码),而在另一个地方,您将字符作为两个其他字符的组合(带有两个代码)。 You can see even difference when you use len() for boths strings.当您将len()用于两个字符串时,您甚至可以看到差异。 In your example one version has lenght 53 and other has 52在您的示例中,一个版本的长度为53 ,另一个版本的长度为52

It seems you could convert one name to another using unicodedata.normalize() with one of option NFC , NFKC , NFD , NFKD .看来您可以使用unicodedata.normalize()和选项NFCNFKCNFDNFKD之一将一个名称转换为另一个名称。 So you have to test which one will work for you.所以你必须测试哪一个对你有用。

In one direction you may need NFC or NFKC , in other direction you may need NFD or NFKD .在一个方向上,您可能需要NFCNFKC ,在另一个方向上,您可能需要NFDNFKD

You can also use unidecode to create text without native characters: fa(c)lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg but this may not be so useful for you.您还可以使用unidecode创建没有本地字符的文本: fa(c)lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg但这对您来说可能不是那么有用。

import unicodedata
from unidecode import unidecode

a = 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
b = 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'

print('a:', a)
print('b:', b)

print('--- len ---')
print('len(a):', len(a))
print('len(b):', len(b))

print('--- encode ---')
print('a.encode:', a.encode('utf-8'))
print('b.encode:', b.encode('utf-8'))

print('--- a == normalize(b) ---')
print('NFC: ', a == unicodedata.normalize('NFC', b) )
print('NFKC:', a == unicodedata.normalize('NFKC', b) )
print('NFD: ', a == unicodedata.normalize('NFD', b) )
print('NFKD:', a == unicodedata.normalize('NFKD', b) )

print('--- b == normalize(a) ---')
print('NFC: ', b == unicodedata.normalize('NFC', a) )
print('NFKC:', b == unicodedata.normalize('NFKC', a) )
print('NFD: ', b == unicodedata.normalize('NFD', a) )
print('NFKD:', b == unicodedata.normalize('NFKD', a) )

print('--- unidecode ---')
print('a:', unidecode(a))
print('b:', unidecode(b))

Result:结果:

a: fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
b: fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
--- len ---
len(a): 53
len(b): 52
--- encode ---
a.encode: b'fa\xcc\x83\xc2\xa9lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
b.encode: b'f\xc3\xa3\xc2\xa9lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
--- a == normalize(b) ---
NFC:  False
NFKC: False
NFD:  True
NFKD: True
--- b == normalize(a) ---
NFC:  True
NFKC: True
NFD:  False
NFKD: False
--- unidecode ---
a: fa(c)lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
b: fa(c)lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg

I met characters as combination of two other characters only when I have to transfer MacOS files to other system仅当我必须将 MacOS 文件传输到其他系统时,我才遇到字符作为其他两个字符的组合


Doc: unicodedata文档: unicodedata

Pythonsheet: Unicode Pythonsheet: Unicode

Stackoverflow: Normalizing Unicode Stackoverflow: 规范化 Unicode

The two strings are not the same.两个字符串不一样。 Look:看:

> ciao='fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'.encode('utf-8')       
> bye='fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'.encode('utf-8')        
> ciao.hex() 
 '6661cc83c2a96c69782d64656c2d6d61726c655f6e752d6167656e6f75696c6c2d7375722d666f6e642d626c65752d313933372e6a7067'
> bye.hex()  
 '66c3a3c2a96c69782d64656c2d6d61726c655f6e752d6167656e6f75696c6c2d7375722d666f6e642d626c65752d313933372e6a7067'
> ciao2='fa'.encode('utf-8')
> bye2='f'.encode('utf-8')
> ciao2.hex()
 '6661'
> bye2.hex() 
 '66'

it seems there is an hidden character around the 'f'.似乎'f'周围有一个隐藏的字符。 It seems a 'a'这似乎是一个'a'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM