简体   繁体   中英

Reading Excel file dynamically in Python

I am trying to read an excel which has some blank rows as well as columns. The process becomes more complicated as it has some junk values before the header as well.

在此处输入图片说明

Currently, I am hardcoding a column name to extract the table. This has two drawbacks what if the column is not present in the table and what if the column name repeats in the column value. Is there a way to dynamically write a program that automatically detects the table header and reads the table?

snippet of the code:

raw_data = pd.read_excel('test_data1.xlsx','Sheet8',header=None)

data_duplicate = pd.DataFrame()

for row in range(raw_data.shape[0]): 
    for col in range(raw_data.shape[1]):
        if raw_data.iloc[row,col] == 'Currency':
            data_duplicate = raw_data.iloc[(row+1):].reset_index(drop=True)
            data_duplicate.columns = list(raw_data.iloc[row])
            break
data_duplicate.dropna(axis=1, how='all',inplace=True)
data_duplicate

在此处输入图片说明

Also, the number of bank rows + garbage rows before the header is not fixed.

Here's my way: You can drop all rows and all columns containing Nan

data = pd.read_excel('test.xlsx')
data = data.dropna(how='all', axis = 1)
data = data.dropna(how='all', axis = 0)
data = data.reset_index(drop = True)

better if you put it into a function if you need to open multiple DataFrame in the same code:

data = pd.read_excel('test.xlsx')

def remove_nans(df):
    x = df.dropna(how = 'all', axis = 1)
    x = x.dropna(how = 'all', axis = 0)
    x = x.reset_index(drop = True)
    return x

df = remove_nans(data)
print(df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM