简体   繁体   中英

Cloning github repositories and importing them to a file throws an decoding error

I have a python script where I clone github repositories, and then I open the files that have a .py extension and put them all, into a different file, so I have one large file with all python scripts.

languages = ['py', 'c']

    for lang in languages:
    files = glob.glob(filename + '/**/*.' + lang, recursive=True)
    outfile = open(filename + '/' + lang + '.data', 'w')

    print('processing {} {} files'.format(len(files), lang))

    for infile in files:
        with open(infile) as datafile:
            for line in datafile:
                line = line.rstrip()
                if line:
                    outfile.write(line + '\n')

The error thrown is :

in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 7227: 
character maps to <undefined>.

Probably due to a file that has is encoded with a different standard. Is there a way around this ? My ultimate goal is to have one large python file with all the clones .py files, and .c file with all the clones c files. So can I avoid the different encoded ones or is there a different way around this?

You can try to specify encoding when opening your files by using codecs.open:

import codecs

outfile = codecs.open(filename + '/' + lang + '.data', 'w', encoding='utf8')

and

with codecs.open(infile, encoding='utf8') as datafile:

PS You may want to read this article about dealing with Unicode: https://docs.python.org/2/howto/unicode.html

PPS As you are using Python 3, you may just add an encoding argument to your existing open function without importing codecs module:

outfile = open(filename + '/' + lang + '.data', 'w', encoding='utf8')

and

with open(infile, encoding='utf8') as datafile:

the file probably contains some data which is not a correct utf8. You should check which encoding do they have. It wil be harder to recover it once the files concatenated.

Otherwise, try adding parameter error='surrogateescape' to the open calls, both for reading and writing. This should preserve the byte values of the input, even if it is not a correct utf8.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM