简体   繁体   English

在 Python 中动态读取 Excel 文件

[英]Reading Excel file dynamically in Python

I am trying to read an excel which has some blank rows as well as columns.我正在尝试阅读一个包含一些空白行和列的 excel。 The process becomes more complicated as it has some junk values before the header as well.这个过程变得更加复杂,因为它在标题之前也有一些垃圾值。

在此处输入图片说明

Currently, I am hardcoding a column name to extract the table.目前,我正在对列名进行硬编码以提取表。 This has two drawbacks what if the column is not present in the table and what if the column name repeats in the column value.这有两个缺点,如果列不存在于表中,以及列名在列值中重复会怎样。 Is there a way to dynamically write a program that automatically detects the table header and reads the table?有没有办法动态写一个程序,自动检测表头并读取表?

snippet of the code:代码片段:

raw_data = pd.read_excel('test_data1.xlsx','Sheet8',header=None)

data_duplicate = pd.DataFrame()

for row in range(raw_data.shape[0]): 
    for col in range(raw_data.shape[1]):
        if raw_data.iloc[row,col] == 'Currency':
            data_duplicate = raw_data.iloc[(row+1):].reset_index(drop=True)
            data_duplicate.columns = list(raw_data.iloc[row])
            break
data_duplicate.dropna(axis=1, how='all',inplace=True)
data_duplicate

在此处输入图片说明

Also, the number of bank rows + garbage rows before the header is not fixed.此外,标题前的银行行数+垃圾行数不固定。

Here's my way: You can drop all rows and all columns containing Nan这是我的方式:您可以删除包含 Nan 的所有行和所有列

data = pd.read_excel('test.xlsx')
data = data.dropna(how='all', axis = 1)
data = data.dropna(how='all', axis = 0)
data = data.reset_index(drop = True)

better if you put it into a function if you need to open multiple DataFrame in the same code:如果您需要在相同的代码中打开多个 DataFrame,那么将其放入一个函数中会更好:

data = pd.read_excel('test.xlsx')

def remove_nans(df):
    x = df.dropna(how = 'all', axis = 1)
    x = x.dropna(how = 'all', axis = 0)
    x = x.reset_index(drop = True)
    return x

df = remove_nans(data)
print(df)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM