简体   繁体   English

如何使用不同编码的 utf-8 和 Iso-8 在 Python 上读取多个文件

[英]How to read multiple files on Python with different enconding utf-8 and Iso-8

I am still new to python and I am trying to read multiple files on a loop in python to calculate number of delimiters in each file however I have different encodings 'utf-8' and 'iso-' encoding I am not sure how to write my code with the condition if it's utf-8 then read it if not 'ISO-8' because I receive this error我还是 python 的新手,我正在尝试在 python 中循环读取多个文件以计算每个文件中的分隔符数量但是我有不同的编码 'utf-8' 和 'iso-' 编码我不知道如何写我的代码条件是 utf-8 然后读取它如果不是 'ISO-8' 因为我收到这个错误

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 1707: invalid continuation byte

And this is how my code looks like:这就是我的代码的样子:

Thank you in advance!!先感谢您!!

def number_delimiter_by_file(file):
    fin = open(file, encoding="utf8") 
    with fin as fp:
        lines = fp.readlines()
        i = 0
        a=[]
        for line in lines:
            i = i+1
            delimiter = line.count(";")
            a.append(delimiter)
        b = list(Counter(a).items())
    return(file, b)

#i apply the function on my list of files

for file in all_filenames:
    print(number_delimiter_by_file(file))

You have at least 2 ways to go past this error.您至少有两种方法可以解决此错误。

  1. Ignore the actual encoding and only use ISO-8859-1 (aka Latin1).忽略实际编码,仅使用 ISO-8859-1(又名 Latin1)。 This encoding translates any byte on file into the Unicode character having that value.此编码将文件中的任何字节转换为具有该值的 Unicode 字符。 Because of that, it can read without error any file, but if the file uses a different encoding, some characters could be wrong.因此,它可以无误地读取任何文件,但如果文件使用不同的编码,则某些字符可能会出错。 It should be enough to detect semicolons ( ; ) because they ASCII code (and unicode character code) is 0x22.检测分号 ( ; ) 应该就足够了,因为它们的 ASCII 码(和 unicode 字符码)是 0x22。

  2. Use errors='ignore' when opening the file.打开文件时使用errors='ignore' In that mode, any offending character will be silently ignored.在那种模式下,任何冒犯的角色都将被默默地忽略。 You will probably end with missing characters, but it should be enough to count semicolons characters.您可能会以缺少的字符结尾,但它应该足以计算分号字符。

Fortunately, all ISO-8859-x encodings are ASCII extension, so all ASCII characters have same encoding in any of them...幸运的是,所有 ISO-8859-x 编码都是 ASCII 扩展,因此所有 ASCII 字符在其中任何一个中都具有相同的编码......

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM