I have a python script where I clone github repositories, and then I open the files that have a .py extension and put them all, into a different file, so I have one large file with all python scripts.
languages = ['py', 'c']
for lang in languages:
files = glob.glob(filename + '/**/*.' + lang, recursive=True)
outfile = open(filename + '/' + lang + '.data', 'w')
print('processing {} {} files'.format(len(files), lang))
for infile in files:
with open(infile) as datafile:
for line in datafile:
line = line.rstrip()
if line:
outfile.write(line + '\n')
The error thrown is :
in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 7227:
character maps to <undefined>.
Probably due to a file that has is encoded with a different standard. Is there a way around this ? My ultimate goal is to have one large python file with all the clones .py files, and .c file with all the clones c files. So can I avoid the different encoded ones or is there a different way around this?
You can try to specify encoding when opening your files by using codecs.open:
import codecs
outfile = codecs.open(filename + '/' + lang + '.data', 'w', encoding='utf8')
and
with codecs.open(infile, encoding='utf8') as datafile:
PS You may want to read this article about dealing with Unicode: https://docs.python.org/2/howto/unicode.html
PPS As you are using Python 3, you may just add an encoding argument to your existing open function without importing codecs module:
outfile = open(filename + '/' + lang + '.data', 'w', encoding='utf8')
and
with open(infile, encoding='utf8') as datafile:
the file probably contains some data which is not a correct utf8. You should check which encoding do they have. It wil be harder to recover it once the files concatenated.
Otherwise, try adding parameter error='surrogateescape'
to the open calls, both for reading and writing. This should preserve the byte values of the input, even if it is not a correct utf8.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.