简体   繁体   English

如何使用Pandas / Python处理Excel文件头

[英]How to process excel file headers using pandas/python

I am trying to read https://www.whatdotheyknow.com/request/193811/response/480664/attach/3/GCSE%20IGCSE%20results%20v3.xlsx using pandas. 我正在尝试使用熊猫阅读https://www.whatdotheyknow.com/request/193811/response/480664/attach/3/GCSE%20IGCSE%20results%20v3.xlsx

Having saved it my script is 保存后,我的脚本是

import sys
import pandas as pd
inputfile = sys.argv[1]
xl = pd.ExcelFile(inputfile)
#    print xl.sheet_names
df = xl.parse(xl.sheet_names[0])
print df.head()

However this does not seem to process the headers properly as it gives 但是,这似乎不能正确处理标题

  GCSE and IGCSE1 results2,3 in selected subjects4 of pupils at the end of key stage 4 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10
0                              Year: 2010/11 (Final)                                          NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN         NaN
1                                  Coverage: England                                          NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN         NaN
2                                                NaN                                          NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN         NaN
3  1. Includes International GCSE, Cambridge Inte...                                          NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN         NaN
4  2. Includes attempts and achievements by these...                                          NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN         NaN

All of this should be treated as comments. 所有这些都应视为注释。

If you load the spreadsheet into libreoffice, for example, you can see that the column headings are correctly parsed and appear in row 15 with drop down menus to let you select the items you want. 例如,如果将电子表格加载到libreoffice,则可以看到列标题已正确解析,并显示在第15行中,并带有下拉菜单,可让您选择所需的项目。

How can you get pandas to automatically detect where the column headers are just as libreoffice does? 像libreoffice一样,如何让熊猫自动检测列标题的位置?

pandas is (are?) processing the file correctly, and exactly the way you asked it (them?) to. pandas正在(正在?)正确地处理文件,并且完全按照您要求(他们?)的方式来处理文件。 You didn't specify a header value, which means that it defaults to picking up the column names from the 0th row. 您没有指定header值,这意味着它默认为从第0行开始选择列名。 The first few rows of cells aren't comments in some fundamental way, they're just not cells you're interested in. 单元格的前几行不是从根本上进行注释,它们不是您感兴趣的单元格。

Simply tell parse you want to skip some rows: 只需告诉parse您想跳过一些行即可:

>>> xl = pd.ExcelFile("GCSE IGCSE results v3.xlsx")
>>> df = xl.parse(xl.sheet_names[0], skiprows=14)
>>> df.columns
Index([u'Local Authority Number', u'Local Authority Name', u'Local Authority Establishment Number', u'Unique Reference Number', u'School Name', u'Town', u'Number of pupils at the end of key stage 4', u'Number of pupils attempting a GCSE or an IGCSE', u'Number of students achieving 8 or more GCSE or IGCSE passes at A*-G', u'Number of students achieving 8 or more GCSE or IGCSE passes at A*-A', u'Number of students achieving 5 A*-A grades or more at GCSE or IGCSE'], dtype='object')
>>> df.head()
   Local Authority Number Local Authority Name  \
0                     201       City of london   
1                     201       City of london   
2                     202               Camden   
3                     202               Camden   
4                     202               Camden   

   Local Authority Establishment Number  Unique Reference Number  \
0                               2016005                   100001   
1                               2016007                   100003   
2                               2024104                   100049   
3                               2024166                   100050   
4                               2024196                   100051   

                       School Name    Town  \
0  City of London School for Girls  London   
1            City of London School  London   
2                Haverstock School  London   
3           Parliament Hill School  London   
4               Regent High School  London   

  Number of pupils at the end of key stage 4  \
0                                        105   
1                                        140   
2                                        200   
3                                        172   
4                                        174   

  Number of pupils attempting a GCSE or an IGCSE  \
0                                            104   
1                                            140   
2                                            194   
3                                            169   
4                                            171   

  Number of students achieving 8 or more GCSE or IGCSE passes at A*-G  \
0                                                100                    
1                                                108                    
2                                               SUPP                    
3                                                 22                    
4                                                  0                    

  Number of students achieving 8 or more GCSE or IGCSE passes at A*-A  \
0                                                 87                    
1                                                 75                    
2                                                  0                    
3                                                  7                    
4                                                  0                    

  Number of students achieving 5 A*-A grades or more at GCSE or IGCSE  
0                                                100                   
1                                                123                   
2                                                  0                   
3                                                 34                   
4                                               SUPP                    

[5 rows x 11 columns]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM