简体   繁体   中英

How to detect the number of the rows to skip in reading excel file on pandas

I'd like to read a.xlsx using python pandas. The problem is the at the beginning of the excel file, it has some additional data like title or description of the table and tables contents starts. That introduce the unnamed columns because pandas DataReader takes it as the columns. But tables contents starts after few lines later.

A                              B                     C
this is description
last updated: Mar 18th,2014
                               Table content
Country                        Year                 Product_output
Canada                         2017                 3002
Bulgaria                       2016                 2201
...

The table content starts in line 4. And columns must be "Country", "year", "proudct_output" instead "this is description", "unnamed", "unnamed". For this specific case, setting skiprows parameter to 3 solved the issue(from Mikhail Venkov). But I have to deal with many excel files and I don't know how many rows to skip in advance. I think there might be a solution since each table column header has a filter.

If you know that a specific text (like country) must be in the first column, you can do the following:

import xlrd
xl_work = xlrd.open_workbook("Classeur1.xlsx")
mySheet = xl_work.sheet_by_index(0)

nl = 0
while mySheet.cell_value(nl,0) != "Country" :
    nl += 1

line_with_headers = nl

Then use skiprows with nl instead of 3.

I am also looking for the same solution as you, however, I can make your code shorter and more efficient:

import pandas as pd


file = pd.read_excel("Classeur1.xlsx", header = 10)
file.head()

This way you can read can skip all the rows from 0 ==> 9 and start reading from the 10th row.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM