简体   繁体   English

熊猫read_csv:标头/ skiprows不起作用

[英]pandas read_csv: header/skiprows not working

All- 所有-

First time asking a question here, apologies if format is bad, please let me know how to improve my question. 第一次在这里问一个问题,如果格式不好,我很抱歉,请让我知道如何改善我的问题。

I am seeking a better understanding of the header and skiprows arguments of the pandas.read_csv() function. 我正在寻求对pandas.read_csv()函数的header和skiprows参数的更好的理解。

Here is an example of the raw data I am trying to read in python: 这是我尝试在python中读取的原始数据的示例:

MiniSonde 5 43656
"Log File Name : lwrhyp_deploy_20170104"
"Setup Date (MMDDYY) : 010417"
"Setup Time (HHMMSS) : 114539"
"Starting Date (MMDDYY) : 010417"
"Starting Time (HHMMSS) : 140000"
"Stopping Date (MMDDYY) : 123169"
"Stopping Time (HHMMSS) : 235959"
"Interval (HHMMSS) : 010000"
"Sensor warmup (HHMMSS) : 000100"
"Circltr warmup (HHMMSS) : 000030"


"Date","Time","","Temp","","SpCond","","Sal","","Dep25","","TDG","","TDG","","LDO%","","LDO","","IBatt",""
"MMDDYY","HHMMSS","","øC","","mS/cm","","ppt","","meters","","mmHg","","psia","","Sat","","mg/l","","Volts",""

01/04/17,14:00:00,"",7.97,"",.0691,"",.02,"",.75,"",735,"",14.22,"",52.7,"",6.15,"",11.4,""
01/04/17,15:00:00,"",7.9,"",.0692,"",.02,"",.76,"",736,"",14.23,"",52.8,"",6.17,"",11.4,""
01/04/17,16:00:00,"",7.89,"",.0694,"",.02,"",.77,"",736,"",14.23,"",52.3,"",6.12,"",11.4,""
01/04/17,17:00:00,"",7.88,"",.0699,"",.02,"",.78,"",735,"",14.21,"",51.8,"",6.06,"",11.4,""
01/04/17,18:00:00,"",7.85,"",.0699,"",.02,"",.78,"",733,"",14.18,"",51.3,"",6.01,"",11.4,""
01/04/17,19:00:00,"",7.83,"",.0706,"",.02,"",.78,"",731,"",14.14,"",51.3,"",6.01,"",11.4,""
01/04/17,20:00:00,"",7.81,"",.0706,"",.02,"",.79,"",730,"",14.12,"",51.1,"",5.99,"",11.4,""
01/04/17,21:00:00,"",7.81,"",.0699,"",.02,"",.79,"",730,"",14.11,"",50.8,"",5.95,"",11.4,""
01/04/17,22:00:00,"",7.76,"",.0702,"",.02,"",.8,"",729,"",14.1,"",50.5,"",5.92,"",11.3,""
01/04/17,23:00:00,"",7.76,"",.0704,"",.02,"",.8,"",729,"",14.09,"",50.5,"",5.93,"",11.3,""
01/05/17,00:00:00,"",7.76,"",.07,"",.02,"",.8,"",729,"",14.09,"",50.5,"",5.92,"",11.3,""

I am trying to use either the row beginning with "Date" or the row beginning with "MMDDYY" as my header row. 我试图将以“ Date”开头的行或以“ MMDDYY”开头的行用作标题行。 When I open the raw data in a text editor the row that corresponds to "Date" is row 14 which would be row 13 in zero-indexed python land. 当我在文本编辑器中打开原始数据时,对应于“日期”的行是第14行,这将是零索引python土地中的第13行。

I used the following code thinking that it should skip the first 12 rows and begin reading data on row 13: 我使用以下代码,认为它应该跳过前12行并开始读取第13行的数据:

test = pd.read_csv(filepath, skiprows=12, skip_blank_lines=True)

but that produces the error: 但这会产生错误:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

After a lot of fiddling around, trial and error style, I found that the following code produced the type of result I am after, however I do not understand why it works: 经过反复尝试和反复尝试的风格之后,我发现以下代码产生了我所追求的结果类型,但是我不明白为什么它起作用:

test = pd.read_csv(filepath, skiprows=[14], header=11, skip_blank_lines=True)

I do not understand how read_csv is counting the number of rows. 我不明白read_csv如何计算行数。 Am I incorrect in that the header row is not on line 11 but rather is on line 13? 我是否不正确,因为标题行不在第11行上,而是在第13行上? The code only works if skiprows=[14], why is that? 该代码仅在skiprows = [14]时有效,为什么呢?

On a side note, is there a way to prevent the blank columns that are present in the raw data from being read into the dataframe? 附带说明一下,是否有一种方法可以防止将原始数据中存在的空白列读入数据帧?

First, skiprows isn't doing what you think it is here. 首先, skiprows并没有按照您的想法做。 When you give it a list as input, then it skips those rows when parsing the file. 当给它一个列表作为输入时,在解析文件时它将跳过那些行。 For what you want, just use header instead. 对于您想要的内容,只需使用header

Second, pandas zero-indexes the file rows. 其次,熊猫对文件行进行零索引。

Third, when you have skip_blank_lines=True , it appears to reindex the rows of your file before considering the #header# value. 第三,当您具有skip_blank_lines=True ,在考虑#header#值之前,它似乎为文件的行重新编制了索引。 So in your example, it will not index the blank lines 11 and 12 before your header (and the one after your headers). 因此,在您的示例中,它不会在标题之前(和标题之后的空白行)索引空白行11和12。 Remembering pandas zero-indexes the file rows, we can see how header=11 line sup on the header: 记住熊猫对文件行进行了零索引,我们可以看到header=11上的header header=11行如何:

line/ : content
0:MiniSonde 5 43656
1:"Log File Name : lwrhyp_deploy_20170104"
2:"Setup Date (MMDDYY) : 010417"
3:"Setup Time (HHMMSS) : 114539"
4:"Starting Date (MMDDYY) : 010417"
5:"Starting Time (HHMMSS) : 140000"
6:"Stopping Date (MMDDYY) : 123169"
7:"Stopping Time (HHMMSS) : 235959"
8:"Interval (HHMMSS) : 010000"
9:"Sensor warmup (HHMMSS) : 000100"
10:"Circltr warmup (HHMMSS) : 000030"


11:"Date","Time","","Temp","","SpCond","","Sal","","Dep25","","TDG","","TDG","","LDO%","","LDO","","IBatt",""
12:"MMDDYY","HHMMSS","","øC","","mS/cm","","ppt","","meters","","mmHg","","psia","","Sat","","mg/l","","Volts",""

13:01/04/17,14:00:00,"",7.97,"",.0691,"",.02,"",.75,"",735,"",14.22,"",52.7,"",6.15,"",11.4,""
14:01/04/17,15:00:00,"",7.9,"",.0692,"",.02,"",.76,"",736,"",14.23,"",52.8,"",6.17,"",11.4,""
15:01/04/17,16:00:00,"",7.89,"",.0694,"",.02,"",.77,"",736,"",14.23,"",52.3,"",6.12,"",11.4,""
16:01/04/17,17:00:00,"",7.88,"",.0699,"",.02,"",.78,"",735,"",14.21,"",51.8,"",6.06,"",11.4,""
17:01/04/17,18:00:00,"",7.85,"",.0699,"",.02,"",.78,"",733,"",14.18,"",51.3,"",6.01,"",11.4,""
18:01/04/17,19:00:00,"",7.83,"",.0706,"",.02,"",.78,"",731,"",14.14,"",51.3,"",6.01,"",11.4,""
19:01/04/17,20:00:00,"",7.81,"",.0706,"",.02,"",.79,"",730,"",14.12,"",51.1,"",5.99,"",11.4,""
20:01/04/17,21:00:00,"",7.81,"",.0699,"",.02,"",.79,"",730,"",14.11,"",50.8,"",5.95,"",11.4,""
21:01/04/17,22:00:00,"",7.76,"",.0702,"",.02,"",.8,"",729,"",14.1,"",50.5,"",5.92,"",11.3,""
22:01/04/17,23:00:00,"",7.76,"",.0704,"",.02,"",.8,"",729,"",14.09,"",50.5,"",5.93,"",11.3,""
23:01/05/17,00:00:00,"",7.76,"",.07,"",.02,"",.8,"",729,"",14.09,"",50.5,"",5.92,"",11.3,""

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM