Search (in folders and subfolders ) and read files to a list of dataframes, using Python

Question

I have a code

df1 = pd.read_excel('DIRECTORY\\file.xlsm', sheetname='Resume', header=1, usecols='A:I')
#some operations

bf1 =pd.read_excel('DIRECTORY\\file.xlsm', sheetname='Resume', header=1, usecols='K:P')
#some operations

Final_file = pd.concat([df1,bf1], ignore_index=True)

Note that df and bf are reading the same file, the difference is the columns being read.

I have a lot of files.

Is it possible to go through folders and subfolders, search for a filename pattern and create a list of dataframes to read, instead of writing each path I have?

Answer 1

Here is a code snippet that might help your cause:-

source = r'C:\Mypath\SubFolder'
for root, dirs, files in os.walk(source):
    for name in files:
        if name.endswith((".xls", ".xlsx",".xlsm")):
            filetoprocess=os.path.join(root,name)
            df=pd.read_excel(filetoprocess, sheetname='Resume', header=1, usecols='A:I')

Hope that helps.

Answer 2

You can use glob library to do this -

from glob import glob

filenames = glob('./Folder/pattern*.xlsx')  #pattern is the common pattern in filenames
dataframes = [pd.read_excel(f) for f in filenames] #sequentially read all the files and create a dataframe for each file
master_df = pd.concat(dataframes) #master dataframe after concatenating all the dataframes

Answer 3

you can use a recursive method with both pathlib and glob

note parent_path should be the top level folder you want to search.

from pathlib import Path

files = [file for file in Path(parent_path).rglob('*filename*.xls')]

this will return a list of files that match your condition. you can then cocnat a list comp.

dfs = [ pd.read_excel(file, sheet_name='Resume', header=1, usecols='A:I') for file in files]

df1 = pd.concat(dfs)

Edit Latest File by Modified Time.

We can use the following function to take in a path and return a list of pathlib objects to get the latest modified time, we do this by splitting on a delimiter to get a unique file so sales_v1, sales_v2, sales_v3 will all become sales. We then get the latest modified file from the three.

import pandas as pd
from pathlib import Path
def get_latest_files(path):

    files = {
        f: pd.Timestamp(f.stat().st_mtime, unit="s") for f in Path(path).rglob("*.csv")
    }

    df = (
        pd.DataFrame.from_dict(files, orient="index")
        .reset_index()
        .rename(columns={"index": "path", 0: "seconds"})
    )

    df["dupe_files"] = df["path"].apply(lambda x: x.stem).str.split("_", expand=True)[0]

    max_files = (
        df.groupby(["dupe_files", "path"])
        .max()
        .groupby(level=0)["seconds"]
        .nlargest(1)
        .to_frame()
        .reset_index(-1)["path"]
        .tolist()
    )
    return max_files

Search (in folders and subfolders ) and read files to a list of dataframes, using Python

Question

3 answers

solution1
1 2020-06-03 12:13:51

solution2
1 2020-06-03 12:16:25

solution3
1 ACCPTED 2020-06-03 12:17:12

Edit Latest File by Modified Time.

Search (in folders and subfolders ) and read files to a list of dataframes, using Python

Question

3 answers

solution1 1 2020-06-03 12:13:51

solution2 1 2020-06-03 12:16:25

solution3 1 ACCPTED 2020-06-03 12:17:12

Edit Latest File by Modified Time.

solution1
1 2020-06-03 12:13:51

solution2
1 2020-06-03 12:16:25

solution3
1 ACCPTED 2020-06-03 12:17:12