简体   繁体   English

Python / Pandas:如何在cp1252中读取具有第一行要删除的csv?

[英]Python/Pandas : how to read a csv in cp1252 with a first row to delete?

Solution : 解决方案:

See answer, it was not encoded in CP1252 but in UTF-16 . 请参见答案,它不是在CP1252中编码的,而是在UTF-16中编码的。 Solution code is : 解决方案代码是:

import pandas as pd

df = pd.read_csv('my_file.csv', sep='\t', header=1, encoding='utf-16')

Also works with encoding='utf-16-le' 也适用于encoding='utf-16-le'


Update : output of the first 3 lines in bytes : 更新:以字节为单位的前三行输出:

In : import itertools 
...:  print(list(itertools.islice(open('file_T.csv', 'rb'), 3)))

Out : [b'\xff\xfe"\x00D\x00u\x00 \x00m\x00e\x00r\x00c\x00r\x00e\x00d\x00i\x00 \x000\x005\x00 \x00j\x00u\x00i\x00n\x00 \x002\x000\x001\x009\x00 \x00a\x00u\x00 \x00m\x00e\x00r\x00c\x00r\x00e\x00d\x00i\x00 \x000\x005\x00 \x00j\x00u\x00i\x00n\x00 \x002\x000\x001\x009\x00\n', b'\x00"\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\t\x00\n', b'\x00C\x00o\x00d\x00e\x00 \x00M\x00C\x00U\x00\t\x00I\x00m\x00m\x00a\x00t\x00r\x00i\x00c\x00u\x00l\x00a\x00t\x00i\x00o\x00n\x00\t\x00D\x00a\x00t\x00e\x00\t\x00h\x00e\x00u\x00r\x00e\x00\t\x00V\x00i\x00t\x00e\x00s\x00s\x00e\x00\t\x00L\x00a\x00t\x00i\x00t\x00u\x00d\x00e\x00\t\x00L\x00o\x00n\x00g\x00i\x00t\x00u\x00d\x00e\x00\t\x00T\x00y\x00p\x00e\x00\t\x00E\x00n\x00t\x00r\x00\xe9\x00e\x00\t\x00E\x00t\x00a\x00t\x00\n']

I'm working with csv files whose raw form is : 我正在使用原始格式为的csv文件:

屏幕文件_T

The problem is that it has two features raising a problem together : 问题在于它具有两个共同引起问题的特征:

  • the first row is not the header 第一行不是标题

  • There is an accent in header "Entrée", which raises an UnicodeDecode Error if I don't precise the encoding cp1252 标头“Entrée”中有一个重音,如果我不精确编码cp1252,则会引发UnicodeDecode错误

I'm using Python 3.X and pandas to deal with these files. 我正在使用Python 3.X和pandas处理这些文件。

But when I try to read it with this code : 但是当我尝试使用以下代码阅读它时:

import pandas as pd 

df_T = pd.read_csv('file_T.csv', header=1, sep=';', encoding = 'cp1252')
print(df_T)

I get the following output (same with header=0 ): 我得到以下输出(与header=0相同): file_T的read_csv错误

In order to read the csv correctly, I need to : 为了正确读取csv,我需要:

  • get rid of the accent 摆脱口音
  • and ignore / delete the first row (which I don't need anyway). 并忽略/删除第一行(无论如何我都不需要)。

How can I achieve that ? 我该如何实现?

PS : I know I could make a VBA program or something for this, but I'd rather not. PS:我知道我可以为此制作VBA程序或其他东西,但我宁愿不这样做。 I'm interested in including it in my Python program, or in knowing for sure that it is not possible. 我有兴趣将其包含在我的Python程序中,或者希望确定它是不可能的。

CP1252 is the plain old Latin codepage , which does support all Western European accents. CP1252是普通的旧拉丁语代码页 ,它确实支持所有西欧口音。 There wouldn't be any garbled characters if the file was written in that codepage. 如果文件是用该代码页编写的,则不会出现乱码。

The image of the data you posted is just that - an image. 您发布的数据的图像就是图像。 It says nothing about the file's raw format. 没有说明文件的原始格式。 Is it a UTF8 file? 它是UTF8文件吗? UTF16? UTF-16? It's definitely not CP1252. 绝对不是 CP1252。

Neither UTF8 nor CP1252 would produce NANs either. UTF8和CP1252均不会产生NAN。 Any single-byte codepage would read the numeric digits at least, which means the file is saved in a multi-byte encoding. 任何单字节代码页至少会读取数字,这意味着文件以多字节编码保存。

The two strange characters at the start look like a Byte Order Mark. 开头的两个奇怪字符看起来像字节顺序标记。 If you check Wikipedia's BOM entry you'll see that ÿþ is the BOM for UTF16LE. 如果检查Wikipedia的BOM表条目,您将看到ÿþÿþ的BOM表。

Try using utf-16 or utf-16-le instead of cp1252 尝试使用utf-16utf-16-le代替cp1252

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM