简体   繁体   中英

Pandas merge is very slow on large data set

I have a pipeline where i get 4000k HL7 files. I have to convert that to csv. Each file will have many HL7 Segment and each segment(OBX) will have one column (COL1, COl2..COL100) and it value and Time. Each file may have 50 to 100 OBX ie. so is the column. I am looping through each column and creating pandas data frame and appending the column it.if column belong to time which is already there in data frame. it should append column if time is not there in data frame create new row in data frame. finally I marge all the data frame for all the files. It takes lot time. I observed that final merge (function process_hl7msg)takes lot of time.

def parse_segments():
    df_num = pd.DateFrame()
    for each segment in segments:    
        obx_timestamp = get obx_timestamp from segment    
        observation_value = get obx_timestamp from segment    
        device = get device info from segment    
        df = pd.DataFrame()    
        df=df.append({"Time": obx_timestamp, obs_identifier: observation_value, "device": device}, ignore_index=True)    
        if df_num.empty:
           df_num = df
        else:
           df_num = pd.merge(df_num, df, on=["Time", "device"])
    return df_num    


def process_hl7msg():
    df_list = []
    for file_name in file_list:
       segments = get segments
       df_list.append(parse_segments(segments))

    for df1 in df_list:
        if df.empty:
            df = df1
        else:
            df = pd.merge(df, df1, on=["Time", "device"], how='outer')

Below is example of each parsed hl7 files and expected out put.

File 1  
Time                       EVENT device  COL1  COL2   
20200420232613.6200+0530   start device1 1.0   2.3  
20200420232614.6200+0530         device1 4.4   1.7  

File 2   
Time                      EVENT  device  COL3   COL4  COL5   
20200420232613.6200+0530         device1  44     66    7
20200420232614.6200+0530         device2  1.0    2.3    0.5   
20200420232615.6200+0530  pause  device3  4.4    1.7    0.9

File 3
20200420232613.6200+0530   device2 1.0   2.3  
...
File 4000



**Expected Output:**    
Time                      EVENT device   COL1  COL2  COL3   COL4  COL5   
20200420232613.6200+0530  start  device1  1.0   2.3    44     66    7
20200420232613.6200+0530         device2  1.0   2.3  
20200420232614.6200+0530  end    device1         4.4   1.7  
20200420232615.6200+0530  pause  device2               1.0    2.3    0.5   
20200420232616.6200+0530         device3               4.4    1.7    0.9

Any suggestion to optimize this would appreciated

UPDATE1:

obx_timestamp =20200420232616.6200+0530 
obs_identifier= any one or more value from the list (COL1, COL2, ......COl10)
observation_value any numeric value
device it can be any one of from the list (device1,device2, device3, device4, device5)

UPDATE2:
Added Event column

t3=[{'Time': 100, 'device': 'device1', 'EVENT':'' 'event','obx_idx': 'MDC1','value':1.2}, 
    {'Time': 100, 'device': 'device1', 'obx_idx': 'COL2','value':4.5},
    {'Time': 100, 'device': 'device1', 'obx_idx': 'COL4','value':4.5}, 
    {'Time': 200, 'device': 'device3', 'obx_idx': 'COL2','value':2.5},
    {'Time': 200, 'device': 'device3', 'obx_idx': 'COl3','value':2.5}]
df=pd.DataFrame.from_records(t3, index=['Time','device','EVENT','obx_idx'])['value'].unstack()

Try setting the index on both dataframes and do a join:

df.set_index(["Time", "device"], inplace=True)
df1.set_index(["Time", "device"], inplace=True)
df.join(df1, how = 'outer')

However, based on the expected output, you can also try doing a concat on axis = 1 :

df.set_index(["Time", "device"], inplace=True)
df1.set_index(["Time", "device"], inplace=True)
df_f = pd.concat([df, df1], axis=1)

Here is how you can change your function, the idea is to not create dataframes at each loop in parse_segment but only at the end, using from_records specifying the index level to be able to use unstack just after. And to use pd.concat with axis=1 in process_hl7msg , try

def parse_segments():
    l_seg = []
    for each segment in segments:    
        obx_timestamp = get obx_timestamp from segment    
        obs_identifier = get ...
        observation_value = get obx_timestamp from segment    
        device = get device info from segment    
        # append a dictionary to a list
        l_seg.append({'time': obx_timestamp, 'device':device, 
                      'obs_idx':obs_identifier, 'value':observation_value})
    # create the dataframe with from_records and specify the index
    return pd.DataFrame.from_records(l_seg, index=['time','device','obs_idx'])['value']\
                       .unstack()    

def process_hl7msg():
    df_list = []
    for file_name in file_list:
       segments = get segments
       df_list.append(parse_segments(segments))
    #use concat
    return pd.concat(df_list, axis=1).reset_index()

if it is not too big (not sure about this data source), you could even do all at once:

def process_hl7msg():
    l_values = []
    for file_name in file_list:
        segments = get segments
        # process segments 
        for each segment in segments:    
            obx_timestamp = get obx_timestamp from segment    
            obs_identifier = get ...
            observation_value = get obx_timestamp from segment    
            device = get device info from segment    
            # append a dictionary to a list
            l_values.append({'time': obx_timestamp, 'device':device, 
                             'obs_idx':obs_identifier, 'value':observation_value})
    #return all at once    
    return pd.DataFrame.from_records(l_values, index=['time','device','obs_idx'])['value']\
                       .unstack() 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM