简体   繁体   English

Pandas:使用正则表达式读取文件

[英]Pandas: Reading files with regex

I am trying to read multiple excel files while using wildcards and putting it in saparate dataframes using pandas.我正在尝试读取多个 excel 文件,同时使用通配符并将其放入使用 pandas 的单独数据帧中。

i have read base path and will be using below to access subdirectories:我已经阅读了基本路径,并将在下面使用来访问子目录:

>>>inputs_path
'C:/Users/ABC/Downloads/Input'
>>>path1 = os.chdir(inputs_path + "/path1")
>>>fls=glob.glob("*.*")
>>>fls

['Zambia_W4.xlsm',
 'Australia_W4.xlsx',
 'France_W4.xlsx',
 'Japan_W3.xlsm',
 'India_W3.xlsx',
 'Italy_W3.xlsx',
 'MEA_W5.xlsx',
 'NE_W5.xlsm',
 'Russia_W5.xlsx',
 'Spain_W2.xlsx']
>>>path2 = os.chdir(inputs_path + "/path2")
>>>fls=glob.glob("*.*")
>>>fls

['Today.xlsm',
 'Yesterday.xlsx',
 'Tomorrow.xlsx']

Right now i am reading them as follows:现在我正在阅读它们如下:

>>>df_italy = pd.read_excel("Italy_W3.xlsx",sheet_name='Sheet1')
>>>df_russia = pd.read_excel("Russia_W5.xlsx",sheet_name='Sheet3')
>>>df_france_1 = pd.read_excel("France_W4.xlsx",sheet_name='Sheet1', usecols = 'M, Q', skiprows=4)
>>>df_spain = pd.read_excel("Spain_W2.xlsx",sheet_name='Sheet2',usecols = 'T:U', skiprows=30 )
>>>df_ne = pd.read_excel("NE_W5.xlsm",sheet_name='Sheet2',usecols = 'N,P', skiprows=4 )
>>>df_ne_c = pd.read_excel("NE_W5.xlsm",sheet_name='Sheet1',usecols = 'H:J', skiprows=141 )

Since i have filenames in the list fls, is there a way i could use that list and read files without actually having to use the actual filename since the filename will change as per week number.由于我在列表 fls 中有文件名,有没有一种方法可以使用该列表并读取文件而无需实际使用实际文件名,因为文件名将根据周数更改。 Also its mandatory to keep the dataframe names as mentioned above while reading the excel files.在阅读 excel 文件时,还必须保留上述 dataframe 名称。

i am looking to read the file as我希望将文件读取为

>>>df_italy = pd.read_excel("Italy*.xlsx",sheet_name='Sheet1')

Is there any way to do this?有没有办法做到这一点?

If your files always have a _ to split on you could create a dictionary with the split value as the key, and the file path as the location.如果您的文件总是有一个_要拆分,您可以创建一个字典,其中拆分值作为键,文件路径作为位置。

Lets use Pathlib which was added in Python 3.4+ as it's easier to use with file systems.让我们使用 Python 3.4+ 中添加的 Pathlib,因为它更容易与文件系统一起使用。

Regex Matching FileName.正则表达式匹配文件名。

Assuming your dictionary is created as above with filenames and paths as the values we could do this.假设您的字典是如上创建的,其中文件名和路径作为我们可以执行此操作的值。 You'll need to extend the function to deal with multiple file matches.您需要扩展 function 以处理多个文件匹配。

import re
from pathlib import path

file_dict = {file.stem : file for file in location.glob('*.xlsx')}

# assume the numbers are paths.
files = {'Zambia_W4.xlsm': 2,
 'Australia_W4.xlsx': 5,
 'France_W4.xlsx': 0,
 'Japan_W3.xlsm': 7,
 'India_W3.xlsx': 2,
 'Italy_W3.xlsx': 6,
 'MEA_W5.xlsx': 7,
 'NE_W5.xlsm': 4,
 'Russia_W5.xlsx': 3,
 'Spain_W2.xlsx': 5}

def file_name_match(file_dict,pattern):

    for name, source in file_dict.items():
        if re.search(pattern,name,flags=re.IGNORECASE):
            return file_dict.get(name)

file_name_match(file_dict,'italy')
output: 6

df = pd.read_excel(file_name_match(file_dict,'italy'),sheetname=...)

It might be feasible to simply populate a dictionary of dataframes like this:像这样简单地填充数据框字典可能是可行的:

my_dfs = {}
for f in fls:
    my_dfs[f.split(“.”)[0]] = pandas.dataframe(f.split(“,”)[0], ...)

You can use a for loop also to just run the job you need to do for each file, which shouldn't require knowledge of the file name.您也可以使用 for 循环来运行您需要为每个文件执行的作业,这不需要知道文件名。 Also, it's possible to also just read all the spreadsheets into one df, and ensure there is an additional column that has the corresponding file name for each row.此外,也可以将所有电子表格读入一个 df,并确保有一个附加列,其中每一行都有相应的文件名。

The code below assumes you have several files for each country, and need to sort them to find the latest week.下面的代码假设每个国家/地区都有多个文件,并且需要对它们进行排序以找到最近的一周。

import glob
import os
import re

def find_country_file(country_name):
  all_country_files = glob.glob(os.path.join(inputs_path, '{0}_W*.*'))
  week_numbers = [re.search('W([0-9]+)', x) for x in all_country_files]
  week_numbers = [int(x.group(1)) for x in week_numbers if x is not None]
  latest_week_number = sorted(week_numbers, reversed=True)[0]
  latest_country_file = [x for x in all_country_files if 'W{0}.'.format(latest_week_number) in x]
  return os.path.basename(latest_country_file)


df_italy = pd.read_excel(find_country_file('Italy') , sheet_name='Sheet1')
df_russia = pd.read_excel(find_country_file('Russia'), sheet_name='Sheet3')
df_france_1 = pd.read_excel(find_country_file('France'),sheet_name='Sheet1', usecols = 'M, Q', skiprows=4)
df_spain = pd.read_excel(find_country_file('Spain'),sheet_name='Sheet2',usecols = 'T:U', skiprows=30 )
df_ne = pd.read_excel(find_country_file('NE'),sheet_name='Sheet2',usecols = 'N,P', skiprows=4 )
df_ne_c = pd.read_excel(find_country_file('NE'),sheet_name='Sheet1',usecols = 'H:J', skiprows=141)

the method find_country searches for all files with the country name in the path, uses regex to pull out the week number, sorts them to find the highest number, and then returns the file path from the glob of all country files that matches the latest week found. find_country 方法在路径中搜索所有带有国家名称的文件,使用正则表达式提取周数,对它们进行排序以找到最大的数字,然后从所有与最近一周匹配的国家文件的 glob 中返回文件路径成立。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM