从多个 excel 表中的多个选项卡中跳过一个特定的 excel 选项卡（Pandas Python）

Question

I have a routine in place to convert my multiple excel files, with multiple tabs and multiple columns ( some tabs are present in the excel sheets, some are not, but the column structuring inside all the tabs is the same for all the sheets ) to a dictionary of dictionaries.我有一个例程来转换我的多个 excel 文件，具有多个选项卡和多个列（一些选项卡存在于 excel 工作表中，有些没有，但所有选项卡内的列结构对于所有工作表都是相同的）字典词典。 I'm facing an issue while skipping one specific tab from some of the excel sheets.从一些 excel 表中跳过一个特定选项卡时，我遇到了一个问题。 I know we define the name of the sheets which we want to include in the data structure in the sheet_name parameter in the read_excel function of pandas .我知道我们在 pandas 的read_excel function 的sheet_name参数中定义了要包含在数据结构中的工作表的名称。 But, the problem here is that I want to skip one specific tab ( Sheet1 ) from all the excel sheets, and also, the tab names I'm defining other than that in the sheet_name parameter are not present in each of the excel sheets.但是，这里的问题是我想从所有 excel 工作表中跳过一个特定的选项卡（ Sheet1 ），而且，我在 sheet_name 参数中定义的选项卡名称在每个 excel 工作表中都不存在。 Please let me know if there are any workarounds here.请让我知道这里是否有任何解决方法。 Thank you!!谢谢！！

#Assigning the path to the folder variable
folder = r'specified_path'

#Changing the directory to the database directory
os.chdir(folder) 

#Getting the list of files from the assigned path
files = os.listdir(folder) 

#Joining the list of files to the assigned path
for archivedlist in files:
    local_path = os.path.join(folder, archivedlist)
    print("Joined Path: ", local_path)

#Reading the data from the files in the dictionary data structure
main_dict = {}
def readdataframe(files):
    df_dict = {}
    for element in files:
        df_dict[element] = pd.read_excel(element, sheet_name = ["Sheet2", "Sheet3", "Sheet4", 
                                                            "Sheet5", "Sheet6", "Sheet7",
                                                           "Sheet8"])
        print(df_dict[element].keys)
    return df_dict

print(readdataframe(files))

I want to skip sheet1 from all the excel files wherever it is present and want to extract the sheets[2-8] from all the excel files if they are present there.我想从所有存在的 excel 文件中跳过 sheet1，并希望从所有 excel 文件中提取工作表 [2-8]（如果它们存在）。 Also, a side note is that I could extract all the data from all the excel files when I was using sheet_name = None , but that is not the expected result.另外，附带说明一下，当我使用sheet_name = None时，我可以从所有 excel 文件中提取所有数据，但这不是预期的结果。

Lastly, all the tabs which are extracted from all the excel sheets should be a pandas data frame.最后，从所有 excel 表中提取的所有选项卡都应该是 pandas 数据框。

Answer 1

I was able to resolve this query by creating two functions.我能够通过创建两个函数来解决这个查询。 The first function I created takes the input as the sheet name I want to skip/delete and the master dictionary (df_dict).我创建的第一个 function 将输入作为我要跳过/删除的工作表名称和主字典 (df_dict)。 Below is the code for the function:下面是 function 的代码：

def delete_key(rm_key, df_dict):
    '''This routine is used to delete any tab from a nested dictionary '''
    
    #Checking for the tab name if it is present in the master dictionary. If yes, delete it directly from there
    if rm_key in df_dict:
        del df_dict[rm_key]
        
    #Looping in the master dictionary to check for the tab name to be deleted
    for val in df_dict.values():
        if isinstance(val, dict):
            df_dict = delete_key(rm_key, val) #Deleting the whole tab with its value from the master dictionary using a recursive routine
    
    return df_dict

We need to call this function once we get our data structure from the routine mentioned in the question.从问题中提到的例程中获取数据结构后，我们需要调用此 function。 The changes in that routine are as follows:该例程的变化如下：

folder = r'specified_path'

files = os.listdir(folder)

def readdataframe(files):
    '''This routine is used to read multiple excel files into a nested 
    dictionary of data frames'''
    
    for element in files:
        df_dict[element] = pd.read_excel(element, sheet_name = None)
        
        for num in df_dict[element]:
            df_dict[element][num] = pd.DataFrame.from_dict(df_dict[element][num])
            print("Filename: ", element, "Tab Name: ", num, "Type: ", type(df_dict1[element][num]))
    return df_dict

When we execute both of these functions, we get the output as a dictionary of data frames which is not having the sheet that we want to skip.当我们执行这两个函数时，我们得到 output 作为数据帧字典，它没有我们想要跳过的工作表。

Please follow these routines, and they will work.请遵循这些例程，它们将起作用。 Let me know if you face any issues.如果您遇到任何问题，请告诉我。

For simplicity, I have created three excel files with the same number of tabs inside them (Sheet1, Sheet2, Sheet3).为简单起见，我创建了三个 excel 文件，其中包含相同数量的选项卡（Sheet1、Sheet2、Sheet3）。 The columns inside the tabs are also the same.选项卡内的列也是相同的。 Please check below the output.请查看下面的 output。 We get this output by running the readdataframe(files) function.我们通过运行 readdataframe(files) function 得到这个 output。

Output:

Joined Path:  specified_path\1.xlsx
Joined Path:  specified_path\2.xlsx
Joined Path:  specified_path\3.xlsx
Filename:  1.xlsx Tab Name:  Sheet1 Type:  <class 'pandas.core.frame.DataFrame'>
Filename:  1.xlsx Tab Name:  Sheet2 Type:  <class 'pandas.core.frame.DataFrame'>
Filename:  1.xlsx Tab Name:  Sheet3 Type:  <class 'pandas.core.frame.DataFrame'>
Filename:  2.xlsx Tab Name:  Sheet1 Type:  <class 'pandas.core.frame.DataFrame'>
Filename:  2.xlsx Tab Name:  Sheet2 Type:  <class 'pandas.core.frame.DataFrame'>
Filename:  2.xlsx Tab Name:  Sheet3 Type:  <class 'pandas.core.frame.DataFrame'>
Filename:  3.xlsx Tab Name:  Sheet1 Type:  <class 'pandas.core.frame.DataFrame'>
Filename:  3.xlsx Tab Name:  Sheet2 Type:  <class 'pandas.core.frame.DataFrame'>
Filename:  3.xlsx Tab Name:  Sheet3 Type:  <class 'pandas.core.frame.DataFrame'>
{'1.xlsx': {'Sheet1':    A  B  C  D
0  1  1  1  2
1  2  2  4  2
2  3  3  2  4
3  4  1  3  3, 'Sheet2':    A
0  1
1  2
2  3
3  4, 'Sheet3':    B
0  3
1  4
2  5
3  6}, '2.xlsx': {'Sheet1':    A  B  C  D
0  1  1  1  2
1  2  2  4  2
2  3  3  2  4
3  4  1  3  3, 'Sheet2':    A
0  1
1  2
2  3
3  4, 'Sheet3':    B
0  3
1  4
2  5
3  6}, '3.xlsx': {'Sheet1':    A  B  C  D
0  1  1  1  2
1  2  2  4  2
2  3  3  2  4
3  4  1  3  3, 'Sheet2':    A
0  1
1  2
2  3
3  4, 'Sheet3':    B
0  3
1  4
2  5
3  6}}

Once we get this output, we can delete Sheet1 using delete_key('Sheet1', df_dict) function.一旦我们得到这个 output，我们可以使用 delete_key('Sheet1', df_dict) function 删除 Sheet1。 The output after running this function is as follows:运行这个function后的output如下：

Output:

{'1.xlsx': {'Sheet2':    A
0  1
1  2
2  3
3  4,
'Sheet3':    B
0  3
1  4
2  5
3  6},
'2.xlsx': {'Sheet2':    A
0  1
1  2
2  3
3  4,
'Sheet3':    B
0  3
1  4
2  5
3  6},
'3.xlsx': {'Sheet2':    A
0  1
1  2
2  3
3  4,
'Sheet3':    B
0  3
1  4
2  5
3  6}}

This is how we can see that Sheet one was removed from all the excel files.这就是我们如何看到第一张表已从所有 excel 文件中删除。

从多个 excel 表中的多个选项卡中跳过一个特定的 excel 选项卡（Pandas Python）

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-09-19 22:47:09

从多个 excel 表中的多个选项卡中跳过一个特定的 excel 选项卡（Pandas Python）

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-09-19 22:47:09

解决方案1
0 已采纳 2022-09-19 22:47:09