克隆github存储库并将其导入文件会引发解码错误

Question

I have a python script where I clone github repositories, and then I open the files that have a .py extension and put them all, into a different file, so I have one large file with all python scripts. 我有一个python脚本，可以在其中克隆github仓库，然后打开扩展名为.py的文件，并将它们全部放入另一个文件中，因此我拥有一个包含所有python脚本的大文件。

languages = ['py', 'c']

    for lang in languages:
    files = glob.glob(filename + '/**/*.' + lang, recursive=True)
    outfile = open(filename + '/' + lang + '.data', 'w')

    print('processing {} {} files'.format(len(files), lang))

    for infile in files:
        with open(infile) as datafile:
            for line in datafile:
                line = line.rstrip()
                if line:
                    outfile.write(line + '\n')

The error thrown is : 引发的错误是：

in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 7227: 
character maps to <undefined>.

Probably due to a file that has is encoded with a different standard. 可能是由于文件已使用不同的标准编码。 Is there a way around this ? 有没有解决的办法？ My ultimate goal is to have one large python file with all the clones .py files, and .c file with all the clones c files. 我的最终目标是拥有一个包含所有克隆.py文件的大型python文件，以及具有所有克隆c文件的.c文件。 So can I avoid the different encoded ones or is there a different way around this? 那么我可以避免使用不同的编码方式还是有其他解决方法呢？

Answer 1

You can try to specify encoding when opening your files by using codecs.open: 您可以尝试使用codecs.open在打开文件时指定编码：

import codecs

outfile = codecs.open(filename + '/' + lang + '.data', 'w', encoding='utf8')

and 和

with codecs.open(infile, encoding='utf8') as datafile:

PS You may want to read this article about dealing with Unicode: https://docs.python.org/2/howto/unicode.html PS：您可能需要阅读有关处理Unicode的这篇文章： https : //docs.python.org/2/howto/unicode.html

PPS As you are using Python 3, you may just add an encoding argument to your existing open function without importing codecs module: PPS在使用Python 3时，您可以仅在现有的open函数中添加一个编码参数，而无需导入编解码器模块：

outfile = open(filename + '/' + lang + '.data', 'w', encoding='utf8')

and 和

with open(infile, encoding='utf8') as datafile:

Answer 2

the file probably contains some data which is not a correct utf8. 该文件可能包含一些不正确的utf8数据。 You should check which encoding do they have. 您应该检查它们具有哪种编码。 It wil be harder to recover it once the files concatenated. 文件连接后，将更难恢复它。

Otherwise, try adding parameter error='surrogateescape' to the open calls, both for reading and writing. 否则，尝试将参数error='surrogateescape'到打开的调用中，以进行读取和写入。 This should preserve the byte values of the input, even if it is not a correct utf8. 即使它不是正确的utf8，也应保留输入的字节值。

克隆github存储库并将其导入文件会引发解码错误

问题描述

2 个解决方案

解决方案1
0 2018-01-27 16:44:40

解决方案2
0 2018-01-28 15:06:18

克隆github存储库并将其导入文件会引发解码错误

问题描述

2 个解决方案

解决方案1 0 2018-01-27 16:44:40

解决方案2 0 2018-01-28 15:06:18

解决方案1
0 2018-01-27 16:44:40

解决方案2
0 2018-01-28 15:06:18