简体   繁体   中英

How to merge 2000 CSV files saved in different subfolders within the same main folder

Hey People I would like to merge 2000 Csv files into one of 2000 sub-folders. Each sub-folder contains three Csv files with different names. so I need to select only one Csv from each folder.

I know the code for how to merge bunch of Csv files if they are in the same - folder.

import pandas as pd
import glob

path = r'Total_csvs' 
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)
frame.to_csv('Total.csv',index=False)

But my problems with 2000 Csv files look totally different.

Folder structure is: Main folder (with in this 2000 subfolders, within subfolders I had multiple Csv Files and I need to select only one Csv file from this. Finally concating all 2000 Csv files.)

Coming to Naming Conventions (all the subfolders had different names, but the subfolder name and the Csv name within the subfolder is same)

Any suggestions or a sample code (how to read 2000 Csv from sub-folders) would be helpful.

Thanks in advance

We can iterate on every subfolder, determine expected_csv_path , check if it exists. If it exists, we add them to our all_files list.

Try following:

import pandas as pd
import os

path = r'Total_csvs'
li = []
for f in os.listdir(path):
    expected_csv_path = os.path.join(path, f, f + '.csv')
    csv_exists = os.path.isfile(expected_csv_path)
    if csv_exists:
        df = pd.read_csv(expected_csv_path, index_col=None, header=0)
        li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True, sort=False)
frame.to_csv('Total.csv',index=False)

You can loop through all the subfolders using os.listdir .

Since the CSV filename is the same as the subfolder name, simply use the subfolder name to construct the full path name.

import os
import pandas

folders = os.listdir("Total_csvs")

li = []

for folder in folders:
    # Since they are the same name
    selected_csv = folder
    filename = os.path.join(folder, selected_csv + ".csv")

    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)
frame.to_csv('Total.csv',index=False)

You can do it without joining paths:

import pathlib,pandas

lastparent=None
for ff in pathlib.Path("Total_csvs").rglob("*.csv"):  # recursive glob
    print(ff)
    if(ff.parent!=lastparent):  # process the 1st file in the dir
        lastparent= ff.parent
        df = pd.read_csv(str(ff),... )
        ...etc.

If you are using Python 3.5 or newer you could use glob.glob in recursive manner following way:

import glob
path = r'Total_csvs'
all_csv = glob.glob(path+"/**/*.csv",recursive=True)

Now all_csv is list of relative paths to all *.csv inside Total_csv and subdirectories of Total_csv and subdirectories of subdirectories of Total_csv and so on. For example purpose lets assume that all_csv is now:

all_csv = ['Total_csvs/abc/abc.csv','Total_csv/abc/another.csv']

So we need to get files with names correnponding to directory of their residence, this could be done following way:

import os
def check(x):
    directory,filename = x.split(os.path.sep)[-2:]
    return directory+'.csv'==filename
all_csv = [i for i in all_csv if check(i)]
print(all_csv) #prints ['Total_csvs/abc/abc.csv']

Now all_csv is list of paths to all .csv you are seeking and you can use it same way as you did with all_csv in "flat" (non-recursive) case.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM