[英]Finding unknown non-English characters in a text file (python)
假设我们有一个文本文件加载:
file = open('my_file.txt',mode='r')
stg = file.read()
此文件包含一些非英语的未知字符。 这些字符可能有不同的 forms,如Á
、 î
、 Ç
等。如何提取这些字符及其在文本文件中的位置? 所以 output 是这些字符及其位置(行号)的列表。
因此,假设您不想查找所有非 [english, number, punctuation, backslash] 字符,您可以使用以下代码查找所有位置和数字
[(match.start(0), match.group()) for match in re.finditer(f'[^a-zA-Z0-9{string.punctuation}\\\]', stg)]
使用示例
ÁbxcsdasîîîîîîîîîîîîÇÇadasda/.1.32131.!#@%$%&*^()|\}}"?>:{}?><<"
它会返回
[(0, 'Á'), (8, 'î'), (9, 'î'), (10, 'î'), (11, 'î'), (12, 'î'), (13, 'î'), (14, 'î'), (15, 'î'), (16, 'î'), (17, 'î'), (18, 'î'), (19, 'î'), (20, 'Ç'), (21, 'Ç')]
这是我用于我的一个项目的代码。 它不检查标点符号和特殊字符。
file = open('test.txt',mode='r')
lines = file.readlines()
def isEnglishChar(s):
try:
s.encode(encoding='utf-8').decode('ascii')
except UnicodeDecodeError:
return False
else:
return True
for index, value in enumerate(lines):
for i in range(0, len(value)):
bool = isEnglishChar(value[i])
if(not bool):
print (value[i], index+1)
ASCII 字符的 Unicode 值介于 0 和 127 之间。任何 Unicode 值大于 127 的字符都不是 ASCII。
with open(filename) as fp:
for lineno, line in enumerate(fp, start=1):
for ch in line:
if ord(ch) > 127:
print(lineno, ch)
with open("testfile.txt", 'w') as f_out:
test_text= '''
This file contains some non-English unknown characters.
These characters may have different forms like Á,
î, Ç, etc. How can I extract these characters with their location in the text file
'''
f_out.write(test_text)
with open("testfile.txt") as fp:
for lineno, line in enumerate(fp, start=1):
ch_count = 0
for ch in line:
ch_count += 1
if ord(ch) > 127:
print(f'{lineno=}\tCharacter Number={ch_count}\t {ch=}')
Output
lineno=3 Character Number=52 ch='Á'
lineno=4 Character Number=5 ch='î'
lineno=4 Character Number=8 ch='Ç'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.