I have a code
df1 = pd.read_excel('DIRECTORY\\file.xlsm', sheetname='Resume', header=1, usecols='A:I')
#some operations
bf1 =pd.read_excel('DIRECTORY\\file.xlsm', sheetname='Resume', header=1, usecols='K:P')
#some operations
Final_file = pd.concat([df1,bf1], ignore_index=True)
Note that df and bf are reading the same file, the difference is the columns being read.
I have a lot of files.
Is it possible to go through folders and subfolders, search for a filename pattern and create a list of dataframes to read, instead of writing each path I have?
Here is a code snippet that might help your cause:-
source = r'C:\Mypath\SubFolder'
for root, dirs, files in os.walk(source):
for name in files:
if name.endswith((".xls", ".xlsx",".xlsm")):
filetoprocess=os.path.join(root,name)
df=pd.read_excel(filetoprocess, sheetname='Resume', header=1, usecols='A:I')
Hope that helps.
You can use glob library to do this -
from glob import glob
filenames = glob('./Folder/pattern*.xlsx') #pattern is the common pattern in filenames
dataframes = [pd.read_excel(f) for f in filenames] #sequentially read all the files and create a dataframe for each file
master_df = pd.concat(dataframes) #master dataframe after concatenating all the dataframes
you can use a recursive method with both pathlib
and glob
note parent_path
should be the top level folder you want to search.
from pathlib import Path
files = [file for file in Path(parent_path).rglob('*filename*.xls')]
this will return a list of files that match your condition. you can then cocnat a list comp.
dfs = [ pd.read_excel(file, sheet_name='Resume', header=1, usecols='A:I') for file in files]
df1 = pd.concat(dfs)
We can use the following function to take in a path and return a list of pathlib objects to get the latest modified time, we do this by splitting on a delimiter to get a unique file so sales_v1, sales_v2, sales_v3 will all become sales. We then get the latest modified file from the three.
import pandas as pd
from pathlib import Path
def get_latest_files(path):
files = {
f: pd.Timestamp(f.stat().st_mtime, unit="s") for f in Path(path).rglob("*.csv")
}
df = (
pd.DataFrame.from_dict(files, orient="index")
.reset_index()
.rename(columns={"index": "path", 0: "seconds"})
)
df["dupe_files"] = df["path"].apply(lambda x: x.stem).str.split("_", expand=True)[0]
max_files = (
df.groupby(["dupe_files", "path"])
.max()
.groupby(level=0)["seconds"]
.nlargest(1)
.to_frame()
.reset_index(-1)["path"]
.tolist()
)
return max_files
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.