简体   繁体   中英

Extract date from csv file name and load into hive table in a column -python pandas spark

Need some help on a requirement to extract date from csv filename and load into a column.

input file = ABC_XYZ_EXPORT-20170101.csv , ABC_XYZ_EXPORT-20170102.csv

I am able to read both the files in loop , but the date is extracted just once and is static for all records in two different files. I am not sure , but this could be very well because of incorrect loop. Please help. Thanks in advance.

 for input_file in allFiles:
    exc_date = input_file
    exc_date = re.sub('ABC_XYZ_EXPORT-+([0-9]+)[.]csv$', r'\1', exc_date)
    #print(exc_date)
    #PD pandas dataframe
    for d in exc_date:
       csv_input = pd.concat((pd.read_csv(f) for f in allFiles))
       csv_input['Load_date'] = exc_date
       csv_input.to_csv('outputpd.csv')

IIUC, you need to read data from multiple files and assign a Load_Date column to that with its date from file name.

allFiles  = ['ABC_XYZ_EXPORT-20170101.csv' , 'ABC_XYZ_EXPORT-20170102.csv']

csv_input =pd.DataFrame()

for input_file in allFiles:
    #Loop through each file
    exc_date = input_file
    exc_date = re.sub('ABC_XYZ_EXPORT-+([0-9]+)[.]csv$', r'\1', exc_date)
    df=pd.read_csv(input_file)
    df['Load_date'] = exc_date #Add date for that file alone
    csv_input.append(df) # append to previously read data

csv_input.to_csv('outputpd.csv') #Creates a single output file with contents from all files.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM