在文本文件中查找未知的非英文字符（python）

Question

假设我们有一个文本文件加载：

file = open('my_file.txt',mode='r')
stg = file.read()

此文件包含一些非英语的未知字符。 这些字符可能有不同的 forms，如Á 、 î 、 Ç等。如何提取这些字符及其在文本文件中的位置？ 所以 output 是这些字符及其位置（行号）的列表。

Answer 1

因此，假设您不想查找所有非 [english, number, punctuation, backslash] 字符，您可以使用以下代码查找所有位置和数字

[(match.start(0), match.group()) for match in re.finditer(f'[^a-zA-Z0-9{string.punctuation}\\\]', stg)]

使用示例

ÁbxcsdasîîîîîîîîîîîîÇÇadasda/.1.32131.!#@%$%&*^()|\}}"?>:{}?><<"

它会返回

[(0, 'Á'), (8, 'î'), (9, 'î'), (10, 'î'), (11, 'î'), (12, 'î'), (13, 'î'), (14, 'î'), (15, 'î'), (16, 'î'), (17, 'î'), (18, 'î'), (19, 'î'), (20, 'Ç'), (21, 'Ç')]

Answer 2

这是我用于我的一个项目的代码。 它不检查标点符号和特殊字符。

file = open('test.txt',mode='r')
lines = file.readlines()

def isEnglishChar(s):
    try:
        s.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True

for index, value in enumerate(lines):
    for i in range(0, len(value)):
        bool = isEnglishChar(value[i])
        if(not bool):
            print (value[i], index+1)

Answer 3

ASCII 字符的 Unicode 值介于 0 和 127 之间。任何 Unicode 值大于 127 的字符都不是 ASCII。

with open(filename) as fp:
    for lineno, line in enumerate(fp, start=1):
        for ch in line:
            if ord(ch) > 127:
                print(lineno, ch)

Answer 4

with open("testfile.txt", 'w') as f_out:
    test_text= '''
    This file contains some non-English unknown characters. 
    These characters may have different forms like Á, 
    î, Ç, etc. How can I extract these characters with their location in the text file
    '''
    f_out.write(test_text)
with open("testfile.txt") as fp:
    for lineno, line in enumerate(fp, start=1):
        ch_count = 0
        for ch in line:
            ch_count += 1
            if ord(ch) > 127:
                print(f'{lineno=}\tCharacter Number={ch_count}\t {ch=}')

Output

lineno=3    Character Number=52  ch='Á'
lineno=4    Character Number=5   ch='î'
lineno=4    Character Number=8   ch='Ç'

在文本文件中查找未知的非英文字符（python）

问题描述

4 个解决方案

解决方案1
2 2022-02-27 22:14:57

解决方案2
0 2022-02-27 22:34:03

解决方案3
0 2022-02-27 22:53:32

解决方案4
0 2022-02-27 23:29:18

在文本文件中查找未知的非英文字符（python）

问题描述

4 个解决方案

解决方案1 2 2022-02-27 22:14:57

解决方案2 0 2022-02-27 22:34:03

解决方案3 0 2022-02-27 22:53:32

解决方案4 0 2022-02-27 23:29:18

解决方案1
2 2022-02-27 22:14:57

解决方案2
0 2022-02-27 22:34:03

解决方案3
0 2022-02-27 22:53:32

解决方案4
0 2022-02-27 23:29:18