简体   繁体   中英

Elegant way to read multiple csv files and convert them to series objects

I have 5 csv files to read but the column names and operations to do using that csv files remain the same.

For example, in my code I do the 3 steps below (but repeat it for all 5 files)

a) I read csv and store them in a dataframe

b) convert the dataframe contents to upper case

c) drop duplicates using subset = sourceName column

d) create a series object

test_map = pd.read_csv('test_map.csv')
test_map = test_map .apply(lambda x: x.astype(str).str.upper())
test_map = test_map .drop_duplicates(subset=['sourceName'])
test_Id_map = test_map.set_index(['sourceName'])['Id']

val_op_map = pd.read_csv('op_val_map.csv')
val_op_map = val_op_map.apply(lambda x: x.astype(str).str.upper())
val_op_map = val_op_map.drop_duplicates(subset=['sourceName'])
val_op_id_map = val_op_map.set_index(['sourceName'])['Id']

test2_map = pd.read_csv('test2_map.csv')
test2_map = test2_map.apply(lambda x: x.astype(str).str.upper())
test2_map = test2_map.drop_duplicates(subset=['sourceName'])
test2_id_map = test2_map.set_index(['sourceName'])['Id']

visit_map = pd.read_excel('visit_map.xlsx')
visit_map = visit_map.apply(lambda x: x.astype(str).str.upper())
visit_map = visit_map.drop_duplicates(subset=['sourceName'])
visit_Id_map = visit_map.set_index(['sourceName'])['Id']

demo_map = pd.read_excel('demo_map.xlsx')
demo_map = demo_map.apply(lambda x: x.astype(str).str.upper())
demo_map = demo_map.drop_duplicates(subset=['sourceName'])
demo_id_map = demo_map.set_index(['sourceName'])['Id']

I feel repeating the same line of code multiple times for different files is neither efficient nor elegant. Is there any other better approach to write this?

I finally would like to have 5 series objects as output which are test_Id_map , demo_id_map , visit_id_map , test2_id_map , val_op_id_map

assuming you want a list of processed dataframes along with their id maps, you could do this:

filenames = ['test_map.csv','op_val_map.csv', 'test2_map.csv','visit_map.xlsx','demo_map.xlsx']
def preprocessing(fname):
  df= pd.read_csv(fname)
  df= df.apply(lambda x: x.astype(str).str.upper())
  df= df.drop_duplicates(subset=['sourceName'])
  df_Id_map = df.set_index(['sourceName'])['Id']
  return df, df_Id_map

processed = [prerprocessing(fname) for fname in filenames]

Edit: if you want to handle different file types for input you could add it as if condition to your preprocessing function, for example:

def preprocessing(fname):
  if fname.endswith('.csv'):
    df= pd.read_csv(fname)
    df= df.apply(lambda x: x.astype(str).str.upper())
    df= df.drop_duplicates(subset=['sourceName'])
    df_Id_map = df.set_index(['sourceName'])['Id']
  elif fname.endswith('.xlsx'):
    #do your things
  return df, df_Id_map

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM