Python Pandas - 遍历文件夹 of.xlsx 文件，仅使用正则表达式从 Excel 选项卡中添加数据，名称中带有 xx.xx

Question

I have dataset of nearly 30 Excel files.我有近 30 个 Excel 文件的数据集。 I've been able to loop through the folder and append them all into one dataset.我已经能够将文件夹和 append 全部循环到一个数据集中。 My problem is some of these Excel files have more than one tab's worth of data I need.我的问题是这些 Excel 文件中的一些文件包含我需要的多个选项卡的数据。 All of the tabs I need have the same pattern of dates denoted in tab name (eg, 01.21).我需要的所有选项卡都具有以选项卡名称表示的相同日期模式（例如，01.21）。 Obviously Regex is what I need and I know the Regex pattern I need, my problem is I don't know how to use Pandas to loop through each Excel file, check the tab names with regex, and only add data from tabs that have xx.xx in the string.显然正则表达式是我需要的，我知道我需要的正则表达式模式，我的问题是我不知道如何使用 Pandas 循环遍历每个 Excel 文件，使用正则表达式检查选项卡名称，并且只添加来自具有 xx 的选项卡的数据.xx 在字符串中。 For example, if I opened an Excel file and there were 3 tabs: "data 01.22", "financials", and "data 03.23", I would only want it to add data from "data 01.22" and "data 03.23".例如，如果我打开一个 Excel 文件并且有 3 个选项卡：“data 01.22”、“financials”和“data 03.23”，我只希望它添加来自“data 01.22”和“data 03.23”的数据。

The regex pattern I need to identify the name pattern in these tabs is [0-9][0-9]+.[0-9][0-9].我需要在这些选项卡中识别名称模式的正则表达式模式是 [0-9][0-9]+.[0-9][0-9]。 I know I'm close, but I am missing something key and any help is appreciated.我知道我很接近，但我错过了一些关键的东西，感谢任何帮助。

import pandas as pd
import os
import re

# filenames
files = os.listdir()    
excel_names = list(filter(lambda f: f.endswith('.xlsx'), files))

# read them in
excels = [pd.ExcelFile(name, engine='openpyxl') for name in excel_names]

# turn them into dataframes
frames = [x.parse(x.sheet_names[0], header=None,index_col=None) for x in excels]

#These are the tabs 
sh = [x.sheet_names for x in excels]

# I know I need to use this regex below, but where is the question:

#sheet_match = re.findall("[0-9][0-9]+\.[0-9][0-9]", s)

# delete the first row for all frames except the first
# i.e. remove the header row -- assumes it's the first
frames[1:] = [df[1:] for df in frames[1:]]

# concatenate 
combined = pd.concat(frames)

# export 
combined.to_excel("combinedfiles.xlsx", header=False, index=False)

Answer 1

You're really close indeed, you just have to filter the sheets names with re.match .您确实非常接近，您只需要使用re.match过滤工作表名称。 Loop through each Excel file, and for each file, open it and get the list of tab names ( excel_file.sheet_names ) use re.match with the expression you already defined to get only those tabs that match the desired pattern.循环遍历每个 Excel 文件，对于每个文件，打开它并获取选项卡名称列表 ( excel_file.sheet_names ) 使用re.match和您已经定义的表达式以仅获取与所需模式匹配的选项卡。 Read the content of these sheets ( sheet_name=valid_sheets ) adjusting headers and index as needed for you particular case, then, add the extracted content of each excel file to a list.阅读这些工作表的内容 ( sheet_name=valid_sheets ) 根据您的特定情况调整标题和索引，然后将每个 excel 文件的提取内容添加到列表中。 Concatenate the list with pd.concat and generate the new excel file.将列表与pd.concat连接并生成新的 excel 文件。

import pandas as pd
import os
import re

# filenames
files = os.listdir()
excel_names = list(filter(lambda f: f.endswith('.xlsx'), files))

regex = r'[0-9][0-9]+\.[0-9][0-9]'

frame_list = []
# loop through each Excel file
for name in excel_names:
    # open one excel file
    excel_file = pd.ExcelFile(name, engine='openpyxl')
    # get the list of tabs that have xx.xx in the string
    valid_sheets = [tab for tab in excel_file.sheet_names if re.match(regex, tab)]
    # read the content from that tab list
    d = excel_file.parse(sheet_name=valid_sheets, header=0)
    # add the content to the frame list
    frame_list += list(d.values())

combined = pd.concat(frame_list)
combined.to_excel("combinedfiles.xlsx", header=False, index=False)

Python Pandas - 遍历文件夹 of.xlsx 文件，仅使用正则表达式从 Excel 选项卡中添加数据，名称中带有 xx.xx

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-04-05 01:40:15

Python Pandas - 遍历文件夹 of.xlsx 文件，仅使用正则表达式从 Excel 选项卡中添加数据，名称中带有 xx.xx

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-04-05 01:40:15

解决方案1
1 已采纳 2021-04-05 01:40:15