Consolidating column data from number of CSV files into a single CSV file

Question

I am new to Python, especially data handling. This is what I am trying to achieve-

I run CIS test on several servers and produce a CSV file for each server (file name is the same as the server name). The output file from all servers is copied to a central server The output produced looks like below (Truncated output)-

File1: dc1pp1v01.co.uk.csv
Description,Outcome,Result
1.1 Database Placement,/var/lib/mysql,PASSED
1.2 Use dedicated least privilaged account,mysql,PASSED
1.3 Diable MySQL history,Not Found,PASSED

File2: dc1pp2v01.co.uk.csv
Description,Outcome,Result
1.1 Database Placement,/var/lib/mysql,PASSED
1.2 Use dedicated least privilaged account,mysql,PASSED
1.3 Diable MySQL history,Not Found,PASSED

File..n: dc1pp1v02.co.uk.csv
Description,Outcome,Result
1.1 Database Placement,/var/lib/mysql,PASSED
1.2 Use dedicated least privilaged account,mysql,PASSED
1.3 Diable MySQL history,Found,FAILED

What I want is that output should look like-

Description  dc1pp1v01 dc1pp2v01 dc1pp1v02 
0  1.1 Database Placement PASSED   PASSED   PASSED
1  1.2 Use dedicated least privilaged account PASSED   PASSED   PASSED
2  1.3 Diable MySQL history PASSED   PASSED   FAILED

To merge these files, I have created another file with only Description field in it and two-column heading as below-

file: cis_report.csv
Description,Result
1.1 Database Placement,
1.2 Use dedicated least privilaged account,
1.3 Diable MySQL history,

I have written below code to do column-based merge-

import glob
import os
import pandas as pd 

col_list = ["Description","Result"]
path = "/Users/Python/Data"
all_files = glob.glob(os.path.join(path, "dc*.csv"))

cis_df = pd.read_csv("/Users/Python/Data/cis_report.csv")

for fl in all_files:
   d = pd.read_csv(fl, usecols=col_list)
   f = cis_df.merge(d, on='Description')
   cis_df = f.copy()
   
print(cis_df.head())

The output I am getting is-

Description Result_x Result_y Result_x Result_y
0                      1.1 Database Placement      NaN   PASSED   PASSED   PASSED
1  1.2 Use dedicated least privilaged account      NaN   PASSED   PASSED   PASSED
2                    1.3 Diable MySQL history      NaN   PASSED   PASSED   FAILED

In my code, I am not sure how I get the file name as a header for the result and get rid of NaN.

Also, is there a better way of achieving the output I am looking for without using dummy file(cis_report.csv)? Your help is much appreciated.

Answer 1

You need the DataFrme.pivot() function. The code below is well commented and a fully working example. Make changes as you need

import os
import pandas as pd

# Get all file names in a directory
# Use . to use current working directory or replace it with
# e.g. r'C:\Users\Dames\Desktop\csv_files'
file_names = os.listdir('.')

# Filter out all non .csv files
# You can skip this if you know that only .csv files will be in that folder
csv_file_names = [fn for fn in file_names if fn[-4:] == '.csv']

# This Loads a csv file into a dataframe and sets the Server column
def load_csv(file_name):
    df = pd.read_csv(file_name)
    df['Server'] = file_name.split('.')[0]
    return df

# Append all the csvfiles after being processed by load_csv
df = pd.DataFrame().append([load_csv(fn) for fn in csv_file_names])

# Turn DataFrame into Pivot Table
df = df.pivot('Description', 'Server', 'Result')

# Save DataFrame into CSV File
# If this script runs multiple times make sure that the final.csv is saved elsewhere!
# Or it will be read by the code above as an input file
df.to_csv('final.csv')

The final DataFrame looks like this

Server                                     dc1pp1v01 dc1pp1v02 dc1pp2v01
Description
1.1 Database Placement                        PASSED    PASSED    PASSED
1.2 Use dedicated least privilaged account    PASSED    PASSED    PASSED
1.3 Diable MySQL history                      PASSED    FAILED    PASSED

And the CSV file like this

Description,dc1pp1v01,dc1pp1v02,dc1pp2v01
1.1 Database Placement,PASSED,PASSED,PASSED
1.2 Use dedicated least privilaged account,PASSED,PASSED,PASSED
1.3 Diable MySQL history,PASSED,FAILED,PASSED

Answer 2

Use -

import glob
import os
import pandas as pd 

col_list = ["Description","Result"]
path = "/Users/Python/Data"
all_files = glob.glob(os.path.join(path, "dc*.csv"))

cis_df = pd.read_csv("/Users/Python/Data/cis_report.csv")
from functools import reduce
df_final = reduce(lambda left,right: pd.merge(left,right,on='Description'), [cis_df]+[pd.read_csv(i, usecols=col_list) for i in all_files])
df_final.drop([i for i in df_final.columns if 'Outcome' in i], axis=1).rename(columns={i:j for i,j in zip([i for i in df_final.columns if 'Result' in i], [i.replace('.co.uk.csv','') for i in all_files])})

Output

    Description dc1pp1v01   dc1pp2v01   dc1pp1v02
0   1.1 Database Placement  PASSED  PASSED  PASSED
1   1.2 Use dedicated least privilaged account  PASSED  PASSED  PASSED
2   1.3 Diable MySQL history    PASSED  PASSED  FAILED

Answer 3

Finally, I managed to do it on my own. Below solution works for me but I am sure there are more concise way of doing it-

import glob
import os
import pandas as pd 
from functools import reduce

col_list = ["Description","Result"]
path = "/Users/Python/Data"
all_files = glob.glob(os.path.join(path, "dc*.csv"))

final_cols = ['Description']
for j in all_files:
    final_cols.append(os.path.basename(j).split('.',1)[0]) 

cis_df = pd.read_csv("/Users/Python/Data/cis_report.csv")

df_final = reduce(lambda left,right: pd.merge(left,right,on='Description'), [cis_df]+[pd.read_csv(i, usecols=col_list) for i in all_files])
df_final.rename(columns=dict(zip(df_final.columns,final_cols)),inplace=True)

print(df_final.head())

I made a small change in the description holding file. Removed result field and the ',' at the endo each line

file: cis_report.csv

Description
1.1 Database Placement
1.2 Use dedicated least privilaged account
1.3 Diable MySQL history

The output I get is-

Description dc1pp1v01 dc1pp2v01 dc2pp1v01
0                      1.1 Database Placement        PASSED        PASSED        PASSED
1  1.2 Use dedicated least privilaged account        PASSED        PASSED        PASSED
2                    1.3 Diable MySQL history        PASSED        PASSED        FAILED

Answer 4

You already have a winner, nevertheless:

import csv
from pathlib import Path

path = Path('/Users/Python/Data')

# Read the reports and store the results in a 2-dim list
results = []
for file in path.glob('dc*.co.uk.csv'):
    with open(file, 'r') as fin:
        results += [[file.name.split('.')[0]]
                    + [row[2] for row in csv.reader(fin)][1:]]

# Read the row labels
with open(path / 'cis_report.csv', 'r') as fin:
    labels = [row[0] for row in csv.reader(fin)]

# Prepare the output
output = [[label] + [result[i] for result in results]
          for i, label in enumerate(labels)]

# Write the output
with open(path / 'cis_reports_merged.csv', 'w') as fout:
    csv.writer(fout, delimiter='\t').writerows(output)

Consolidating column data from number of CSV files into a single CSV file

Question

4 answers

solution1
3 ACCPTED 2020-09-21 10:02:11

solution2
0 2020-09-17 18:05:52

solution3
0 2020-09-21 13:41:35

solution4
0 2020-09-25 22:57:41

Consolidating column data from number of CSV files into a single CSV file

Question

4 answers

solution1 3 ACCPTED 2020-09-21 10:02:11

solution2 0 2020-09-17 18:05:52

solution3 0 2020-09-21 13:41:35

solution4 0 2020-09-25 22:57:41

solution1
3 ACCPTED 2020-09-21 10:02:11

solution2
0 2020-09-17 18:05:52

solution3
0 2020-09-21 13:41:35

solution4
0 2020-09-25 22:57:41