I have a pipeline where i get 4000k HL7 files. I have to convert that to csv. Each file will have many HL7 Segment and each segment(OBX) will have one column (COL1, COl2..COL100) and it value and Time. Each file may have 50 to 100 OBX ie. so is the column. I am looping through each column and creating pandas data frame and appending the column it.if column belong to time which is already there in data frame. it should append column if time is not there in data frame create new row in data frame. finally I marge all the data frame for all the files. It takes lot time. I observed that final merge (function process_hl7msg)takes lot of time.
def parse_segments():
df_num = pd.DateFrame()
for each segment in segments:
obx_timestamp = get obx_timestamp from segment
observation_value = get obx_timestamp from segment
device = get device info from segment
df = pd.DataFrame()
df=df.append({"Time": obx_timestamp, obs_identifier: observation_value, "device": device}, ignore_index=True)
if df_num.empty:
df_num = df
else:
df_num = pd.merge(df_num, df, on=["Time", "device"])
return df_num
def process_hl7msg():
df_list = []
for file_name in file_list:
segments = get segments
df_list.append(parse_segments(segments))
for df1 in df_list:
if df.empty:
df = df1
else:
df = pd.merge(df, df1, on=["Time", "device"], how='outer')
Below is example of each parsed hl7 files and expected out put.
File 1
Time EVENT device COL1 COL2
20200420232613.6200+0530 start device1 1.0 2.3
20200420232614.6200+0530 device1 4.4 1.7
File 2
Time EVENT device COL3 COL4 COL5
20200420232613.6200+0530 device1 44 66 7
20200420232614.6200+0530 device2 1.0 2.3 0.5
20200420232615.6200+0530 pause device3 4.4 1.7 0.9
File 3
20200420232613.6200+0530 device2 1.0 2.3
...
File 4000
**Expected Output:**
Time EVENT device COL1 COL2 COL3 COL4 COL5
20200420232613.6200+0530 start device1 1.0 2.3 44 66 7
20200420232613.6200+0530 device2 1.0 2.3
20200420232614.6200+0530 end device1 4.4 1.7
20200420232615.6200+0530 pause device2 1.0 2.3 0.5
20200420232616.6200+0530 device3 4.4 1.7 0.9
Any suggestion to optimize this would appreciated
UPDATE1:
obx_timestamp =20200420232616.6200+0530
obs_identifier= any one or more value from the list (COL1, COL2, ......COl10)
observation_value any numeric value
device it can be any one of from the list (device1,device2, device3, device4, device5)
UPDATE2:
Added Event column
t3=[{'Time': 100, 'device': 'device1', 'EVENT':'' 'event','obx_idx': 'MDC1','value':1.2},
{'Time': 100, 'device': 'device1', 'obx_idx': 'COL2','value':4.5},
{'Time': 100, 'device': 'device1', 'obx_idx': 'COL4','value':4.5},
{'Time': 200, 'device': 'device3', 'obx_idx': 'COL2','value':2.5},
{'Time': 200, 'device': 'device3', 'obx_idx': 'COl3','value':2.5}]
df=pd.DataFrame.from_records(t3, index=['Time','device','EVENT','obx_idx'])['value'].unstack()
Try setting the index on both dataframes and do a join:
df.set_index(["Time", "device"], inplace=True)
df1.set_index(["Time", "device"], inplace=True)
df.join(df1, how = 'outer')
However, based on the expected output, you can also try doing a concat
on axis = 1
:
df.set_index(["Time", "device"], inplace=True)
df1.set_index(["Time", "device"], inplace=True)
df_f = pd.concat([df, df1], axis=1)
Here is how you can change your function, the idea is to not create dataframes at each loop in parse_segment
but only at the end, using from_records
specifying the index level to be able to use unstack
just after. And to use pd.concat
with axis=1 in process_hl7msg
, try
def parse_segments():
l_seg = []
for each segment in segments:
obx_timestamp = get obx_timestamp from segment
obs_identifier = get ...
observation_value = get obx_timestamp from segment
device = get device info from segment
# append a dictionary to a list
l_seg.append({'time': obx_timestamp, 'device':device,
'obs_idx':obs_identifier, 'value':observation_value})
# create the dataframe with from_records and specify the index
return pd.DataFrame.from_records(l_seg, index=['time','device','obs_idx'])['value']\
.unstack()
def process_hl7msg():
df_list = []
for file_name in file_list:
segments = get segments
df_list.append(parse_segments(segments))
#use concat
return pd.concat(df_list, axis=1).reset_index()
if it is not too big (not sure about this data source), you could even do all at once:
def process_hl7msg():
l_values = []
for file_name in file_list:
segments = get segments
# process segments
for each segment in segments:
obx_timestamp = get obx_timestamp from segment
obs_identifier = get ...
observation_value = get obx_timestamp from segment
device = get device info from segment
# append a dictionary to a list
l_values.append({'time': obx_timestamp, 'device':device,
'obs_idx':obs_identifier, 'value':observation_value})
#return all at once
return pd.DataFrame.from_records(l_values, index=['time','device','obs_idx'])['value']\
.unstack()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.