简体   繁体   中英

Optimizing Python script for 200K row iteration over pandas DataFrame

I have written the code below to iterate over a dataframe (df_final_exheader), get the file paths and read meta-data from the corresponding images. At this point is feels slower than it needs to be but I have a hard time finding the solution (I have a non-programmer background and might lack basic knowledge).

If I am correct one of the problems is the loop over the dataframe using .iterrows , reading guides I think I need to implement vectorization but I have no clue how to pick this up. Could someone give me a slight nudge in the right direction?

def read_dicom_header(input_dir_im, df_final_exheader):
        i = 0
        for row,index in df_final_exheader.iterrows():
    
            ds = dicom.dcmread(os.path.join(input_dir_im,df_final_exheader.loc[row, 'path']))
            
            header_dict = {'kVp':[0x18,0x60],'uAs':[0X18,0x1153],'EI':[0x18,0x1411],'DAP':[0x18,0x115E],'REX':[0x18, 0x1405]}
            
            for key, val in header_dict.items():
    
                try:
                    df_final_exheader.loc[row, key] = ds[[hex(x) for x in val]].value
                except KeyError:
                    df_final_exheader.loc[row, key] = 'na'
            
            i+=1
            if i%500==0:
                print(row)
                        
        return df_final_exheader

It appears that I have misinterpreted the relevancy of vectorization in this case. The current version is running, when it's finished I'll tweak it and see to what degree improvements could have been made.

Follow up; I finally had some time to run a few variants under timeit;

  • Original code; 2.654 s.

  • Dict moved as suggested by @roganjosh; 2.618 s.

  • Excluding pixel data as suggested by @Cammeron Riddle; 2.340 s.

To conclude; The original script ran for 48 hours, with modifications this could have been improved towards 42 hours.

Again; Thanks for the input everyone!:)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM