简体   繁体   中英

Merge multiple csv files with same name in 10 different subdirectory

i have 10 different subdirectories with same file names in each directory ( 20 files per directory ) and column 0 is the index column in each file.

eg

     **strong text**DIRECTORY  A
    - data_20170101_k.csv
    - data_20170102_k.csv
    - data_20170102_k.csv
    - data_20170103_k.csv
    - data_20170104_k.csv
    - data_20170105_k.csv
    .....
    .....
    - data_20170120_k.csv  



    **DIRECTORY  B**
    - data_20170101_k.csv
    - data_20170102_k.csv
    - data_20170102_k.csv
    - data_20170103_k.csv
    - data_20170104_k.csv
    - data_20170105_k.csv
    .....
    .....
    - data_20170120_k.csv                




    **DIRECTORY  C**
    - data_20170101_k.csv
    - data_20170102_k.csv
    - data_20170102_k.csv
    - data_20170103_k.csv
    - data_20170104_k.csv
    - data_20170105_k.csv
    .....
    .....
    - data_20170120_k.csv                


   Each of the above files contains 6 columns and index_col = 0  with NO
   column headers

   **DIRECTORY  FILES_MERGED**
   - data_20170101_k.csv
   - data_20170102_k.csv
   - data_20170102_k.csv
   - data_20170103_k.csv
   - data_20170104_k.csv
   - data_20170105_k.csv
   .....
   .....
   - data_20170120_k.csv

I want to merge all the files with SAME NAME from EACH subdirectory into 1 file with SAME NAME and save the new file in a NEW subdirectory eg DIRECTORY FILES_MERGED with INDEX = Column 0. The merged file has only one index column with columns 1,2,3,4,5 from each file with same name from each directory

i have read a csv file into a pandas dataframe

   df= pd.read_csv(filename, sep=",", header = None, usecols=[0, 1, 2, 3, 4, 5])

Here is the format of dataframe

my initial original Dataframe:

             0       1        2        3        4     5
   0  1451606820  1.0862  1.08630  1.08578  1.08578  25
   1  1451608800  1.0862  1.08630  1.08578  1.08610  10
   2  1451608860  1.0862  1.08620  1.08578  1.08578  16
   3  1451610180  1.0862  1.08630  1.08578  1.08578  27
   4  1451610480  1.0858  1.08590  1.08560  1.08578  21
   5  1451610540  1.0857  1.08578  1.08570  1.08578   2
   6  1451610600  1.0857  1.08578  1.08570  1.08578   2
   7  1451610720  1.0857  1.08578  1.08570  1.08578   2
   8  1451610780  1.0857  1.08578  1.08570  1.08578   2

   Column '0' = Datetime in Epoch time 
   Columns 1,2,3,4,5 are values 

There are many ways to do this, staying in Pandas I did the following.

With the file structure

root/  
├── dir1/  
│   ├── data_20170101_k   
│   ├── data_20170102_k    
│   ├── ...  
├── dir2/    
│   ├── data_20170101_k    
│   └── data_20170101_k  
│   └── ...   
└── ... 

This code will work, it's a little verbose for explanation but you can shorten with implementation.

import glob
import pandas as pd

CONCAT_DIR = "/FILES_CONCAT/"

# Use glob module to return all csv files under root directory. Create DF from this.
files = pd.DataFrame([file for file in glob.glob("root/*/*")], columns=["fullpath"])

#    fullpath
# 0  root\dir1\data_20170101_k.csv
# 1  root\dir1\data_20170102_k.csv
# 2  root\dir2\data_20170101_k.csv
# 3  root\dir2\data_20170102_k.csv

# Split the full path into directory and filename
files_split = files['fullpath'].str.rsplit("\\", 1, expand=True).rename(columns={0: 'path', 1:'filename'})

#    path       filename
# 0  root\dir1  data_20170101_k.csv
# 1  root\dir1  data_20170102_k.csv
# 2  root\dir2  data_20170101_k.csv
# 3  root\dir2  data_20170102_k.csv

# Join these into one DataFrame
files = files.join(files_split)

#    fullpath                       path        filename
# 0  root\dir1\data_20170101_k.csv  root\dir1   data_20170101_k.csv
# 1  root\dir1\data_20170102_k.csv  root\dir1   data_20170102_k.csv
# 2  root\dir2\data_20170101_k.csv  root\dir2   data_20170101_k.csv
# 3  root\dir2\data_20170102_k.csv  root\dir2   data_20170102_k.csv

# Iterate over unique filenames; read CSVs, concat DFs, save file
for f in files['filename'].unique():
    paths = files[files['filename'] == f]['fullpath'] # Get list of fullpaths from unique filenames
    dfs = [pd.read_csv(path, header=None) for path in paths] # Get list of dataframes from CSV file paths
    concat_df = pd.concat(dfs) # Concat dataframes into one
    concat_df.to_csv(CONCAT_DIR + f) # Save dataframe

This can be achieved in much simple way in shell as:

find . -name "*.csv" | xargs cat > mergedCSV

(Note: Don't use .csv in extension as it will cause inconsistency with find. After this command is finished, file can be renamed as .csv

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM