简体   繁体   中英

Join multiple CSV files by using python pandas

I am trying to create a CSV file from multiple csv files by using python pandas.

accreditation.csv :-

"pid","accreditation_body","score"
"25799","TAAC","4.5"
"25796","TAAC","5.6"
"25798","DAAC","5.7"

ref_university :-

"id","pid","survery_year","end_year"
"1","25799","2018","2018"
"2","25797","2016","2018"

I want to create a new table by reading the instruction from table_structure.csv . I want to join two tables and rewrite the accreditation.csv . REFERENCES ref_university(id, survey_year) is connecting with ref_university.csv and inserting id and survery_year columns value by matching the pid column value.

table_structure.csv :-

table_name,attribute_name,attribute_type,Description
,,,
accreditation,accreditation_body,varchar,
,grading,varchar,
,pid,int4, "REFERENCES ref_university(id, survey_year)"
,score,float8,

Modified CSV file should look like,

New accreditation.csv :-

"accreditation_body","grading","pid","id","survery_year","score"
"TAAC","","25799","1","2018","2018","4.5"
"TAAC","","25797","2","2016","2018","5.6"
"DAAC","","25798","","","","5.7"

I can read the csv in panda

df = pd.read_csv("accreditation.csv")

But, what is the recommended way to read the REFERENCES instruction and pick the columns value. If there is no value then column should be blank. We can not hardcore pid in panda function. We have to read table_structure.csv and match if there is a Reference then call the mentioned columns. It should not be merged, just the specific columns should be added.

Dynamic solution is possible, but not so easy:

df = pd.read_csv("table_structure.csv")

#remove only NaNs rows
df = df.dropna(how='all')
#repalce NaNs by forward filling
df['table_name'] = df['table_name'].ffill()

#create for each table_name one row
df = (df.dropna(subset=['Description'])
       .join(df.groupby('table_name')['attribute_name'].apply(list)
              .rename('cols'), 'table_name'))

#get name of DataFrame and new columns names
df['df1'] = df['Description'].str.extract('REFERENCES\s*(.*)\s*\(')
df['new_cols'] = df['Description'].str.extract('\(\s*(.*)\s*\)')
df['new_cols'] = df['new_cols'].str.split(', ')
#remove unnecessary columns
df = df.drop(['attribute_type','Description'], axis=1).set_index('table_name')
print (df)
table_name                                                                
accreditation            pid  [accreditation_body, grading, pid, score]   

                          df1           new_cols  
table_name                                        
accreditation  ref_university  [id, survey_year]  

#for select by named create dictioanry of DataFrames
data = {'accreditation' : pd.read_csv("accreditation.csv"), 
        'ref_university': pd.read_csv("ref_university.csv")}

#seelct by index
v = df.loc['accreditation']
print (v)
attribute_name                                          pid
cols              [accreditation_body, grading, pid, score]
df1                                          ref_university
new_cols                                  [id, survey_year]
Name: accreditation, dtype: object

Select by dictionary and by Series v

df = pd.merge(data[v.name], 
               data[v['df1']][v['new_cols'] + [v['attribute_name']]], 
               on=v['attribute_name'], 
               how='left')

is converted to:

df = pd.merge(data['accreditation'], 
               data['ref_university'][['id', 'survey_year'] + ['pid']], 
               on='pid', 
               how='left')

and return:

print (df)
     pid accreditation_body  score   id  survey_year
0  25799               TAAC    4.5  1.0       2018.0
1  25796               TAAC    5.6  NaN          NaN
2  25798               DAAC    5.7  NaN          NaN

Last add new columns by union and reindex :

df = df.reindex(columns=df.columns.union(v['cols']))
print (df)
  accreditation_body  grading   id    pid  score  survey_year
0               TAAC      NaN  1.0  25799    4.5       2018.0
1               TAAC      NaN  NaN  25796    5.6          NaN
2               DAAC      NaN  NaN  25798    5.7          NaN

Here is the working code. Try it. When files are huge set low_memory=False in pd.read_csv()

 import pandas as pd import glob # gets path to the folder datafolder path = r"C:\Users\data_folder" # reads all files with.csv ext filenames = glob.glob(path + "\*.csv") print('File names:', filenames) df=pd.DataFrame() # for loop to iterate and concat csv files for file in filenames: temp=pd.read_csv(file,low_memory=False) df= pd.concat([df, temp], axis=1) #set axis =0 if you want to join rows df.to_csv('output.csv')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM