熊猫read_csv：标头/ skiprows不起作用

Question

All- 所有-

First time asking a question here, apologies if format is bad, please let me know how to improve my question. 第一次在这里问一个问题，如果格式不好，我很抱歉，请让我知道如何改善我的问题。

I am seeking a better understanding of the header and skiprows arguments of the pandas.read_csv() function. 我正在寻求对pandas.read_csv（）函数的header和skiprows参数的更好的理解。

Here is an example of the raw data I am trying to read in python: 这是我尝试在python中读取的原始数据的示例：

MiniSonde 5 43656
"Log File Name : lwrhyp_deploy_20170104"
"Setup Date (MMDDYY) : 010417"
"Setup Time (HHMMSS) : 114539"
"Starting Date (MMDDYY) : 010417"
"Starting Time (HHMMSS) : 140000"
"Stopping Date (MMDDYY) : 123169"
"Stopping Time (HHMMSS) : 235959"
"Interval (HHMMSS) : 010000"
"Sensor warmup (HHMMSS) : 000100"
"Circltr warmup (HHMMSS) : 000030"


"Date","Time","","Temp","","SpCond","","Sal","","Dep25","","TDG","","TDG","","LDO%","","LDO","","IBatt",""
"MMDDYY","HHMMSS","","øC","","mS/cm","","ppt","","meters","","mmHg","","psia","","Sat","","mg/l","","Volts",""

01/04/17,14:00:00,"",7.97,"",.0691,"",.02,"",.75,"",735,"",14.22,"",52.7,"",6.15,"",11.4,""
01/04/17,15:00:00,"",7.9,"",.0692,"",.02,"",.76,"",736,"",14.23,"",52.8,"",6.17,"",11.4,""
01/04/17,16:00:00,"",7.89,"",.0694,"",.02,"",.77,"",736,"",14.23,"",52.3,"",6.12,"",11.4,""
01/04/17,17:00:00,"",7.88,"",.0699,"",.02,"",.78,"",735,"",14.21,"",51.8,"",6.06,"",11.4,""
01/04/17,18:00:00,"",7.85,"",.0699,"",.02,"",.78,"",733,"",14.18,"",51.3,"",6.01,"",11.4,""
01/04/17,19:00:00,"",7.83,"",.0706,"",.02,"",.78,"",731,"",14.14,"",51.3,"",6.01,"",11.4,""
01/04/17,20:00:00,"",7.81,"",.0706,"",.02,"",.79,"",730,"",14.12,"",51.1,"",5.99,"",11.4,""
01/04/17,21:00:00,"",7.81,"",.0699,"",.02,"",.79,"",730,"",14.11,"",50.8,"",5.95,"",11.4,""
01/04/17,22:00:00,"",7.76,"",.0702,"",.02,"",.8,"",729,"",14.1,"",50.5,"",5.92,"",11.3,""
01/04/17,23:00:00,"",7.76,"",.0704,"",.02,"",.8,"",729,"",14.09,"",50.5,"",5.93,"",11.3,""
01/05/17,00:00:00,"",7.76,"",.07,"",.02,"",.8,"",729,"",14.09,"",50.5,"",5.92,"",11.3,""

I am trying to use either the row beginning with "Date" or the row beginning with "MMDDYY" as my header row. 我试图将以“ Date”开头的行或以“ MMDDYY”开头的行用作标题行。 When I open the raw data in a text editor the row that corresponds to "Date" is row 14 which would be row 13 in zero-indexed python land. 当我在文本编辑器中打开原始数据时，对应于“日期”的行是第14行，这将是零索引python土地中的第13行。

I used the following code thinking that it should skip the first 12 rows and begin reading data on row 13: 我使用以下代码，认为它应该跳过前12行并开始读取第13行的数据：

test = pd.read_csv(filepath, skiprows=12, skip_blank_lines=True)

but that produces the error: 但这会产生错误：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

After a lot of fiddling around, trial and error style, I found that the following code produced the type of result I am after, however I do not understand why it works: 经过反复尝试和反复尝试的风格之后，我发现以下代码产生了我所追求的结果类型，但是我不明白为什么它起作用：

test = pd.read_csv(filepath, skiprows=[14], header=11, skip_blank_lines=True)

I do not understand how read_csv is counting the number of rows. 我不明白read_csv如何计算行数。 Am I incorrect in that the header row is not on line 11 but rather is on line 13? 我是否不正确，因为标题行不在第11行上，而是在第13行上？ The code only works if skiprows=[14], why is that? 该代码仅在skiprows = [14]时有效，为什么呢？

On a side note, is there a way to prevent the blank columns that are present in the raw data from being read into the dataframe? 附带说明一下，是否有一种方法可以防止将原始数据中存在的空白列读入数据帧？

Answer 1

First, skiprows isn't doing what you think it is here. 首先， skiprows并没有按照您的想法做。 When you give it a list as input, then it skips those rows when parsing the file. 当给它一个列表作为输入时，在解析文件时它将跳过那些行。 For what you want, just use header instead. 对于您想要的内容，只需使用header 。

Second, pandas zero-indexes the file rows. 其次，熊猫对文件行进行零索引。

Third, when you have skip_blank_lines=True , it appears to reindex the rows of your file before considering the #header# value. 第三，当您具有skip_blank_lines=True ，在考虑＃header＃值之前，它似乎为文件的行重新编制了索引。 So in your example, it will not index the blank lines 11 and 12 before your header (and the one after your headers). 因此，在您的示例中，它不会在标题之前（和标题之后的空白行）索引空白行11和12。 Remembering pandas zero-indexes the file rows, we can see how header=11 line sup on the header: 记住熊猫对文件行进行了零索引，我们可以看到header=11上的header header=11行如何：

line/ : content
0:MiniSonde 5 43656
1:"Log File Name : lwrhyp_deploy_20170104"
2:"Setup Date (MMDDYY) : 010417"
3:"Setup Time (HHMMSS) : 114539"
4:"Starting Date (MMDDYY) : 010417"
5:"Starting Time (HHMMSS) : 140000"
6:"Stopping Date (MMDDYY) : 123169"
7:"Stopping Time (HHMMSS) : 235959"
8:"Interval (HHMMSS) : 010000"
9:"Sensor warmup (HHMMSS) : 000100"
10:"Circltr warmup (HHMMSS) : 000030"


11:"Date","Time","","Temp","","SpCond","","Sal","","Dep25","","TDG","","TDG","","LDO%","","LDO","","IBatt",""
12:"MMDDYY","HHMMSS","","øC","","mS/cm","","ppt","","meters","","mmHg","","psia","","Sat","","mg/l","","Volts",""

13:01/04/17,14:00:00,"",7.97,"",.0691,"",.02,"",.75,"",735,"",14.22,"",52.7,"",6.15,"",11.4,""
14:01/04/17,15:00:00,"",7.9,"",.0692,"",.02,"",.76,"",736,"",14.23,"",52.8,"",6.17,"",11.4,""
15:01/04/17,16:00:00,"",7.89,"",.0694,"",.02,"",.77,"",736,"",14.23,"",52.3,"",6.12,"",11.4,""
16:01/04/17,17:00:00,"",7.88,"",.0699,"",.02,"",.78,"",735,"",14.21,"",51.8,"",6.06,"",11.4,""
17:01/04/17,18:00:00,"",7.85,"",.0699,"",.02,"",.78,"",733,"",14.18,"",51.3,"",6.01,"",11.4,""
18:01/04/17,19:00:00,"",7.83,"",.0706,"",.02,"",.78,"",731,"",14.14,"",51.3,"",6.01,"",11.4,""
19:01/04/17,20:00:00,"",7.81,"",.0706,"",.02,"",.79,"",730,"",14.12,"",51.1,"",5.99,"",11.4,""
20:01/04/17,21:00:00,"",7.81,"",.0699,"",.02,"",.79,"",730,"",14.11,"",50.8,"",5.95,"",11.4,""
21:01/04/17,22:00:00,"",7.76,"",.0702,"",.02,"",.8,"",729,"",14.1,"",50.5,"",5.92,"",11.3,""
22:01/04/17,23:00:00,"",7.76,"",.0704,"",.02,"",.8,"",729,"",14.09,"",50.5,"",5.93,"",11.3,""
23:01/05/17,00:00:00,"",7.76,"",.07,"",.02,"",.8,"",729,"",14.09,"",50.5,"",5.92,"",11.3,""

熊猫read_csv：标头/ skiprows不起作用

问题描述

1 个解决方案

解决方案1
0 2017-07-24 22:04:25

熊猫read_csv：标头/ skiprows不起作用

问题描述

1 个解决方案

解决方案1 0 2017-07-24 22:04:25

解决方案1
0 2017-07-24 22:04:25