简体   繁体   中英

pandas read_csv: header/skiprows not working

All-

First time asking a question here, apologies if format is bad, please let me know how to improve my question.

I am seeking a better understanding of the header and skiprows arguments of the pandas.read_csv() function.

Here is an example of the raw data I am trying to read in python:

MiniSonde 5 43656
"Log File Name : lwrhyp_deploy_20170104"
"Setup Date (MMDDYY) : 010417"
"Setup Time (HHMMSS) : 114539"
"Starting Date (MMDDYY) : 010417"
"Starting Time (HHMMSS) : 140000"
"Stopping Date (MMDDYY) : 123169"
"Stopping Time (HHMMSS) : 235959"
"Interval (HHMMSS) : 010000"
"Sensor warmup (HHMMSS) : 000100"
"Circltr warmup (HHMMSS) : 000030"


"Date","Time","","Temp","","SpCond","","Sal","","Dep25","","TDG","","TDG","","LDO%","","LDO","","IBatt",""
"MMDDYY","HHMMSS","","øC","","mS/cm","","ppt","","meters","","mmHg","","psia","","Sat","","mg/l","","Volts",""

01/04/17,14:00:00,"",7.97,"",.0691,"",.02,"",.75,"",735,"",14.22,"",52.7,"",6.15,"",11.4,""
01/04/17,15:00:00,"",7.9,"",.0692,"",.02,"",.76,"",736,"",14.23,"",52.8,"",6.17,"",11.4,""
01/04/17,16:00:00,"",7.89,"",.0694,"",.02,"",.77,"",736,"",14.23,"",52.3,"",6.12,"",11.4,""
01/04/17,17:00:00,"",7.88,"",.0699,"",.02,"",.78,"",735,"",14.21,"",51.8,"",6.06,"",11.4,""
01/04/17,18:00:00,"",7.85,"",.0699,"",.02,"",.78,"",733,"",14.18,"",51.3,"",6.01,"",11.4,""
01/04/17,19:00:00,"",7.83,"",.0706,"",.02,"",.78,"",731,"",14.14,"",51.3,"",6.01,"",11.4,""
01/04/17,20:00:00,"",7.81,"",.0706,"",.02,"",.79,"",730,"",14.12,"",51.1,"",5.99,"",11.4,""
01/04/17,21:00:00,"",7.81,"",.0699,"",.02,"",.79,"",730,"",14.11,"",50.8,"",5.95,"",11.4,""
01/04/17,22:00:00,"",7.76,"",.0702,"",.02,"",.8,"",729,"",14.1,"",50.5,"",5.92,"",11.3,""
01/04/17,23:00:00,"",7.76,"",.0704,"",.02,"",.8,"",729,"",14.09,"",50.5,"",5.93,"",11.3,""
01/05/17,00:00:00,"",7.76,"",.07,"",.02,"",.8,"",729,"",14.09,"",50.5,"",5.92,"",11.3,""

I am trying to use either the row beginning with "Date" or the row beginning with "MMDDYY" as my header row. When I open the raw data in a text editor the row that corresponds to "Date" is row 14 which would be row 13 in zero-indexed python land.

I used the following code thinking that it should skip the first 12 rows and begin reading data on row 13:

test = pd.read_csv(filepath, skiprows=12, skip_blank_lines=True)

but that produces the error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

After a lot of fiddling around, trial and error style, I found that the following code produced the type of result I am after, however I do not understand why it works:

test = pd.read_csv(filepath, skiprows=[14], header=11, skip_blank_lines=True)

I do not understand how read_csv is counting the number of rows. Am I incorrect in that the header row is not on line 11 but rather is on line 13? The code only works if skiprows=[14], why is that?

On a side note, is there a way to prevent the blank columns that are present in the raw data from being read into the dataframe?

First, skiprows isn't doing what you think it is here. When you give it a list as input, then it skips those rows when parsing the file. For what you want, just use header instead.

Second, pandas zero-indexes the file rows.

Third, when you have skip_blank_lines=True , it appears to reindex the rows of your file before considering the #header# value. So in your example, it will not index the blank lines 11 and 12 before your header (and the one after your headers). Remembering pandas zero-indexes the file rows, we can see how header=11 line sup on the header:

line/ : content
0:MiniSonde 5 43656
1:"Log File Name : lwrhyp_deploy_20170104"
2:"Setup Date (MMDDYY) : 010417"
3:"Setup Time (HHMMSS) : 114539"
4:"Starting Date (MMDDYY) : 010417"
5:"Starting Time (HHMMSS) : 140000"
6:"Stopping Date (MMDDYY) : 123169"
7:"Stopping Time (HHMMSS) : 235959"
8:"Interval (HHMMSS) : 010000"
9:"Sensor warmup (HHMMSS) : 000100"
10:"Circltr warmup (HHMMSS) : 000030"


11:"Date","Time","","Temp","","SpCond","","Sal","","Dep25","","TDG","","TDG","","LDO%","","LDO","","IBatt",""
12:"MMDDYY","HHMMSS","","øC","","mS/cm","","ppt","","meters","","mmHg","","psia","","Sat","","mg/l","","Volts",""

13:01/04/17,14:00:00,"",7.97,"",.0691,"",.02,"",.75,"",735,"",14.22,"",52.7,"",6.15,"",11.4,""
14:01/04/17,15:00:00,"",7.9,"",.0692,"",.02,"",.76,"",736,"",14.23,"",52.8,"",6.17,"",11.4,""
15:01/04/17,16:00:00,"",7.89,"",.0694,"",.02,"",.77,"",736,"",14.23,"",52.3,"",6.12,"",11.4,""
16:01/04/17,17:00:00,"",7.88,"",.0699,"",.02,"",.78,"",735,"",14.21,"",51.8,"",6.06,"",11.4,""
17:01/04/17,18:00:00,"",7.85,"",.0699,"",.02,"",.78,"",733,"",14.18,"",51.3,"",6.01,"",11.4,""
18:01/04/17,19:00:00,"",7.83,"",.0706,"",.02,"",.78,"",731,"",14.14,"",51.3,"",6.01,"",11.4,""
19:01/04/17,20:00:00,"",7.81,"",.0706,"",.02,"",.79,"",730,"",14.12,"",51.1,"",5.99,"",11.4,""
20:01/04/17,21:00:00,"",7.81,"",.0699,"",.02,"",.79,"",730,"",14.11,"",50.8,"",5.95,"",11.4,""
21:01/04/17,22:00:00,"",7.76,"",.0702,"",.02,"",.8,"",729,"",14.1,"",50.5,"",5.92,"",11.3,""
22:01/04/17,23:00:00,"",7.76,"",.0704,"",.02,"",.8,"",729,"",14.09,"",50.5,"",5.93,"",11.3,""
23:01/05/17,00:00:00,"",7.76,"",.07,"",.02,"",.8,"",729,"",14.09,"",50.5,"",5.92,"",11.3,""

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM