pandas read_csv: header/skiprows not working

Question

All-

First time asking a question here, apologies if format is bad, please let me know how to improve my question.

I am seeking a better understanding of the header and skiprows arguments of the pandas.read_csv() function.

Here is an example of the raw data I am trying to read in python:

MiniSonde 5 43656
"Log File Name : lwrhyp_deploy_20170104"
"Setup Date (MMDDYY) : 010417"
"Setup Time (HHMMSS) : 114539"
"Starting Date (MMDDYY) : 010417"
"Starting Time (HHMMSS) : 140000"
"Stopping Date (MMDDYY) : 123169"
"Stopping Time (HHMMSS) : 235959"
"Interval (HHMMSS) : 010000"
"Sensor warmup (HHMMSS) : 000100"
"Circltr warmup (HHMMSS) : 000030"


"Date","Time","","Temp","","SpCond","","Sal","","Dep25","","TDG","","TDG","","LDO%","","LDO","","IBatt",""
"MMDDYY","HHMMSS","","øC","","mS/cm","","ppt","","meters","","mmHg","","psia","","Sat","","mg/l","","Volts",""

01/04/17,14:00:00,"",7.97,"",.0691,"",.02,"",.75,"",735,"",14.22,"",52.7,"",6.15,"",11.4,""
01/04/17,15:00:00,"",7.9,"",.0692,"",.02,"",.76,"",736,"",14.23,"",52.8,"",6.17,"",11.4,""
01/04/17,16:00:00,"",7.89,"",.0694,"",.02,"",.77,"",736,"",14.23,"",52.3,"",6.12,"",11.4,""
01/04/17,17:00:00,"",7.88,"",.0699,"",.02,"",.78,"",735,"",14.21,"",51.8,"",6.06,"",11.4,""
01/04/17,18:00:00,"",7.85,"",.0699,"",.02,"",.78,"",733,"",14.18,"",51.3,"",6.01,"",11.4,""
01/04/17,19:00:00,"",7.83,"",.0706,"",.02,"",.78,"",731,"",14.14,"",51.3,"",6.01,"",11.4,""
01/04/17,20:00:00,"",7.81,"",.0706,"",.02,"",.79,"",730,"",14.12,"",51.1,"",5.99,"",11.4,""
01/04/17,21:00:00,"",7.81,"",.0699,"",.02,"",.79,"",730,"",14.11,"",50.8,"",5.95,"",11.4,""
01/04/17,22:00:00,"",7.76,"",.0702,"",.02,"",.8,"",729,"",14.1,"",50.5,"",5.92,"",11.3,""
01/04/17,23:00:00,"",7.76,"",.0704,"",.02,"",.8,"",729,"",14.09,"",50.5,"",5.93,"",11.3,""
01/05/17,00:00:00,"",7.76,"",.07,"",.02,"",.8,"",729,"",14.09,"",50.5,"",5.92,"",11.3,""

I am trying to use either the row beginning with "Date" or the row beginning with "MMDDYY" as my header row. When I open the raw data in a text editor the row that corresponds to "Date" is row 14 which would be row 13 in zero-indexed python land.

I used the following code thinking that it should skip the first 12 rows and begin reading data on row 13:

test = pd.read_csv(filepath, skiprows=12, skip_blank_lines=True)

but that produces the error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

After a lot of fiddling around, trial and error style, I found that the following code produced the type of result I am after, however I do not understand why it works:

test = pd.read_csv(filepath, skiprows=[14], header=11, skip_blank_lines=True)

I do not understand how read_csv is counting the number of rows. Am I incorrect in that the header row is not on line 11 but rather is on line 13? The code only works if skiprows=[14], why is that?

On a side note, is there a way to prevent the blank columns that are present in the raw data from being read into the dataframe?

Answer 1

First, skiprows isn't doing what you think it is here. When you give it a list as input, then it skips those rows when parsing the file. For what you want, just use header instead.

Second, pandas zero-indexes the file rows.

Third, when you have skip_blank_lines=True , it appears to reindex the rows of your file before considering the #header# value. So in your example, it will not index the blank lines 11 and 12 before your header (and the one after your headers). Remembering pandas zero-indexes the file rows, we can see how header=11 line sup on the header:

line/ : content
0:MiniSonde 5 43656
1:"Log File Name : lwrhyp_deploy_20170104"
2:"Setup Date (MMDDYY) : 010417"
3:"Setup Time (HHMMSS) : 114539"
4:"Starting Date (MMDDYY) : 010417"
5:"Starting Time (HHMMSS) : 140000"
6:"Stopping Date (MMDDYY) : 123169"
7:"Stopping Time (HHMMSS) : 235959"
8:"Interval (HHMMSS) : 010000"
9:"Sensor warmup (HHMMSS) : 000100"
10:"Circltr warmup (HHMMSS) : 000030"


11:"Date","Time","","Temp","","SpCond","","Sal","","Dep25","","TDG","","TDG","","LDO%","","LDO","","IBatt",""
12:"MMDDYY","HHMMSS","","øC","","mS/cm","","ppt","","meters","","mmHg","","psia","","Sat","","mg/l","","Volts",""

13:01/04/17,14:00:00,"",7.97,"",.0691,"",.02,"",.75,"",735,"",14.22,"",52.7,"",6.15,"",11.4,""
14:01/04/17,15:00:00,"",7.9,"",.0692,"",.02,"",.76,"",736,"",14.23,"",52.8,"",6.17,"",11.4,""
15:01/04/17,16:00:00,"",7.89,"",.0694,"",.02,"",.77,"",736,"",14.23,"",52.3,"",6.12,"",11.4,""
16:01/04/17,17:00:00,"",7.88,"",.0699,"",.02,"",.78,"",735,"",14.21,"",51.8,"",6.06,"",11.4,""
17:01/04/17,18:00:00,"",7.85,"",.0699,"",.02,"",.78,"",733,"",14.18,"",51.3,"",6.01,"",11.4,""
18:01/04/17,19:00:00,"",7.83,"",.0706,"",.02,"",.78,"",731,"",14.14,"",51.3,"",6.01,"",11.4,""
19:01/04/17,20:00:00,"",7.81,"",.0706,"",.02,"",.79,"",730,"",14.12,"",51.1,"",5.99,"",11.4,""
20:01/04/17,21:00:00,"",7.81,"",.0699,"",.02,"",.79,"",730,"",14.11,"",50.8,"",5.95,"",11.4,""
21:01/04/17,22:00:00,"",7.76,"",.0702,"",.02,"",.8,"",729,"",14.1,"",50.5,"",5.92,"",11.3,""
22:01/04/17,23:00:00,"",7.76,"",.0704,"",.02,"",.8,"",729,"",14.09,"",50.5,"",5.93,"",11.3,""
23:01/05/17,00:00:00,"",7.76,"",.07,"",.02,"",.8,"",729,"",14.09,"",50.5,"",5.92,"",11.3,""

pandas read_csv: header/skiprows not working

Question

1 answers

solution1
0 2017-07-24 22:04:25

pandas read_csv: header/skiprows not working

Question

1 answers

solution1 0 2017-07-24 22:04:25

solution1
0 2017-07-24 22:04:25