简体   繁体   English

Python Pandas - 遍历文件夹 of.xlsx 文件,仅使用正则表达式从 Excel 选项卡中添加数据,名称中带有 xx.xx

[英]Python Pandas - loop through folder of .xlsx files, only add data from Excel tabs with xx.xx in the name using regex

I have dataset of nearly 30 Excel files.我有近 30 个 Excel 文件的数据集。 I've been able to loop through the folder and append them all into one dataset.我已经能够将文件夹和 append 全部循环到一个数据集中。 My problem is some of these Excel files have more than one tab's worth of data I need.我的问题是这些 Excel 文件中的一些文件包含我需要的多个选项卡的数据。 All of the tabs I need have the same pattern of dates denoted in tab name (eg, 01.21).我需要的所有选项卡都具有以选项卡名称表示的相同日期模式(例如,01.21)。 Obviously Regex is what I need and I know the Regex pattern I need, my problem is I don't know how to use Pandas to loop through each Excel file, check the tab names with regex, and only add data from tabs that have xx.xx in the string.显然正则表达式是我需要的,我知道我需要的正则表达式模式,我的问题是我不知道如何使用 Pandas 循环遍历每个 Excel 文件,使用正则表达式检查选项卡名称,并且只添加来自具有 xx 的选项卡的数据.xx 在字符串中。 For example, if I opened an Excel file and there were 3 tabs: "data 01.22", "financials", and "data 03.23", I would only want it to add data from "data 01.22" and "data 03.23".例如,如果我打开一个 Excel 文件并且有 3 个选项卡:“data 01.22”、“financials”和“data 03.23”,我只希望它添加来自“data 01.22”和“data 03.23”的数据。

The regex pattern I need to identify the name pattern in these tabs is [0-9][0-9]+.[0-9][0-9].我需要在这些选项卡中识别名称模式的正则表达式模式是 [0-9][0-9]+.[0-9][0-9]。 I know I'm close, but I am missing something key and any help is appreciated.我知道我很接近,但我错过了一些关键的东西,感谢任何帮助。

import pandas as pd
import os
import re

# filenames
files = os.listdir()    
excel_names = list(filter(lambda f: f.endswith('.xlsx'), files))

# read them in
excels = [pd.ExcelFile(name, engine='openpyxl') for name in excel_names]

# turn them into dataframes
frames = [x.parse(x.sheet_names[0], header=None,index_col=None) for x in excels]

#These are the tabs 
sh = [x.sheet_names for x in excels]

# I know I need to use this regex below, but where is the question:

#sheet_match = re.findall("[0-9][0-9]+\.[0-9][0-9]", s)

# delete the first row for all frames except the first
# i.e. remove the header row -- assumes it's the first
frames[1:] = [df[1:] for df in frames[1:]]

# concatenate 
combined = pd.concat(frames)

# export 
combined.to_excel("combinedfiles.xlsx", header=False, index=False)


You're really close indeed, you just have to filter the sheets names with re.match .您确实非常接近,您只需要使用re.match过滤工作表名称。 Loop through each Excel file, and for each file, open it and get the list of tab names ( excel_file.sheet_names ) use re.match with the expression you already defined to get only those tabs that match the desired pattern.循环遍历每个 Excel 文件,对于每个文件,打开它并获取选项卡名称列表 ( excel_file.sheet_names ) 使用re.match和您已经定义的表达式以仅获取与所需模式匹配的选项卡。 Read the content of these sheets ( sheet_name=valid_sheets ) adjusting headers and index as needed for you particular case, then, add the extracted content of each excel file to a list.阅读这些工作表的内容 ( sheet_name=valid_sheets ) 根据您的特定情况调整标题和索引,然后将每个 excel 文件的提取内容添加到列表中。 Concatenate the list with pd.concat and generate the new excel file.将列表与pd.concat连接并生成新的 excel 文件。

import pandas as pd
import os
import re

# filenames
files = os.listdir()
excel_names = list(filter(lambda f: f.endswith('.xlsx'), files))

regex = r'[0-9][0-9]+\.[0-9][0-9]'

frame_list = []
# loop through each Excel file
for name in excel_names:
    # open one excel file
    excel_file = pd.ExcelFile(name, engine='openpyxl')
    # get the list of tabs that have xx.xx in the string
    valid_sheets = [tab for tab in excel_file.sheet_names if re.match(regex, tab)]
    # read the content from that tab list
    d = excel_file.parse(sheet_name=valid_sheets, header=0)
    # add the content to the frame list
    frame_list += list(d.values())

combined = pd.concat(frame_list)
combined.to_excel("combinedfiles.xlsx", header=False, index=False)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 来自xx.xx import *的Python导入错误:没有命名模块 - Python import error from xx.xx import * : no module named Python Pandas - 循环浏览 Excel 文件的文件夹,将每个 ZC1D81AF583580.x 文件的数据导出到它们自己的文件“DED8FDCZ”中 - Python Pandas - loop through folder of Excel files, export data from each Excel file's sheet into their own .xlsx file Python/Django:尝试导入 urls.py 时没有名为“xx.xx”的模块 - Python/Django: No module named 'xx.xx' when attempting import to urls.py 使用熊猫从Microsoft Excel中的2个“ hh:mm:ss XX”列执行时间增量 - Using pandas to perform time delta from 2 “hh:mm:ss XX” columns in Microsoft Excel Python openpyxl遍历文件夹中的excel文件 - Python openpyxl loop through excel files in folder 如何使用regex python查找XX st / nd / rd / th - How to locate XX st/nd/rd/th using regex python 使用熊猫循环遍历.xlsx文件,仅第一个文件 - Looping through .xlsx files using pandas, only does first file 使用Python和Openpyxl循环遍历.xlsx,但循环只保存最后一行的数据 - Using Python and Openpyxl to loop through a .xlsx, but loop only save the last row's data 将具有多个 excel 文件和多个选项卡的文件夹中的所有电子邮件提取到 pandas dataframe 中 Z23EEEB4347BDD2556DZ3EEEB4347BDD256BDZ - Extract all emails from a folder with multiple excel files and multiple tabs into a pandas dataframe in python 使用 dhcp 但 IP 是 xx.xx.xx.0 - using dhcp but IP is xx.xx.xx.0
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM