[英]Python: How can I read a file in x encoding and save it as utf-8
I have a huge dataset (about 8.5M records) on a ".csv" file (it uses pipes instead of commas), I got no idea what is its encoding, since I live in Mexico and has accents (á é...) I assume its either latin or iso-8859-1. 我在“ .csv”文件(使用管道而不是逗号)上有一个巨大的数据集(大约850万个记录),我不知道它的编码是什么,因为我住在墨西哥并且有重音(áé... )我假设它是拉丁语或iso-8859-1。
When I try to import the file to a DataFrame using pandas 当我尝试使用熊猫将文件导入到DataFrame中时
bmc=pd.read_csv('file.csv', sep='|',
error_bad_lines=False, encoding='iso-8859-1')
It reads nothing: 它什么也没读:
ÿþF Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
If I don't place iso-8859-1 or latin, I got the error: 如果我不放置iso-8859-1或拉丁文,则会收到错误消息:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
So, to encode the file to utf-8, I open the file in Notepad++ which can read huge files, I mannualy delete the ÿþ at the start of the file, then change the encoding to utf-8 and save as a new file. 因此,要将文件编码为utf-8,我在Notepad ++中打开了可以读取大文件的文件,然后手动删除文件开头的ÿþ,然后将编码更改为utf-8并另存为新文件。
Notepad++ says the file encoding is: UCS-2 LE BOM Notepad ++表示文件编码为: UCS-2 LE BOM
The filesize goes from 1.8Mb to about 0.9Mb, now I can open this file with pandas without problem. 文件大小从1.8Mb到约0.9Mb,现在我可以用熊猫打开此文件了,没有问题。
So I think converting to utf-8 should be part of my preprocessing. 因此,我认为转换为utf-8应该是我预处理的一部分。
I used this solution: How to convert a file to utf-8 in Python? 我使用了以下解决方案: 如何在Python中将文件转换为utf-8? and created a function to convert several files: 并创建了一个转换几个文件的函数:
BLOCKSIZE = 1048576 # or some other, desired size in bytes
def convert_utf8(sourceFileName, targetFileName, sourceEncoding='iso-8859-1'):
with codecs.open(sourceFileName, "r", sourceEncoding) as sourceFile:
with codecs.open(targetFileName, "w", "utf-8") as targetFile:
while True:
contents = sourceFile.read(BLOCKSIZE)
if not contents:
break
targetFile.write(contents)
Now, the problem is, that when the file is written it adds a NULL character after every valid character, let me show it in the editor: 现在的问题是,在写入文件时,它在每个有效字符后添加一个NULL字符,让我在编辑器中显示它:
This file, of course, doesn't work in Pandas. 当然,此文件在Pandas中不起作用。 So far, I have solved my problem using Notepad++, but of course there must be a better way, a way that I don't have to rely on other tools. 到目前为止,我已经使用Notepad ++解决了我的问题,但是当然必须有更好的方法,我不必依赖其他工具。
To convert a file from one encoding to another in Python: 要将文件从一种编码转换为另一种编码,请执行以下操作:
with open('file1.txt',encoding='utf16') as fin:
with open('file2.txt','w',encoding='utf8') as fout:
fout.write(fin.read())
But in your case, as Mark Ransom pointed out in a comment, just open with the appropriate encoding: 但是就您而言,正如Mark Ransom在评论中指出的那样,只需使用适当的编码即可打开:
bmc = pd.read_csv('file.csv', sep='|', error_bad_lines=False, encoding='utf16')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.