[英]Pandas - How to speed up a process on a huge dataframe
I'm working on a script that calculate time unavailability of some equipment I maintain.我正在编写一个脚本来计算我维护的某些设备的时间不可用性。
I have as an input a csv file of our supervising tool (around 2M lines), containing the alarms for a month.我有一个我们监督工具的 csv 文件(大约 2M 行)作为输入,其中包含一个月的警报。
Problem is, it takes a huge time to process it!问题是,处理它需要大量时间!
Once converted as a Pandas DF, I have a df with these columns:一旦转换为 Pandas DF,我就有一个包含这些列的 df:
['date','alarm_key','pcause_id_hex','activity','model_name']
The alarm_key is the same when it's generated and when it's cleared. alarm_key 生成时和清除时是一样的。
I want as an output a new dataframe which contains :我想要一个新的数据框作为输出,其中包含:
['station','name','start_date','end_date','duration']
Below is my code:下面是我的代码:
df = pd.DataFrame([
['01/03/2022 00:01','5693392','CONNECTION KO','Generated','Equip1_Station1'],
['01/03/2022 00:02','5693334','CONNECTION KO','Cleared','Equip2_Station2'],
['01/03/2022 00:02','5693352','CONNECTION KO','Generated','Equip3_Station3'],
['01/03/2022 02:02','5693392','CONNECTION KO','Cleared','Equip1_Station1']
],
columns=['date','alarm_key','pcause_id_hex','activity','model_name']
)
list_alarms = [{}]
for i, row in df.iterrows():
# Process row information
row_info = {
'date': row['date'],
'alarm_key': row['alarm_key'],
'pcause_id_hex': row['pcause_id_hex'],
'activity': row['activity'],
'model_name': row['model_name'],
}
# Check if it's a generated alarm
if row_info['activity'] == 'Generated':
alarm_info = {
'station': '',
'name': '',
'start_date': '',
'end_date': '',
'duration': 0
}
# Fill name / station info
if re.search('_', row_info['model_name']):
alarm_info['name'] = row_info['model_name'].split('_', 1)[
0]
alarm_info['station'] = row_info['model_name'].split('_', 1)[
1]
else:
alarm_info['name'] = ''
alarm_info['station'] = row_info['model_name']
# Fill start date
alarm_info['start_date'] = row_info['date']
start_datetime = datetime.strptime(
row_info['date'], '%d/%m/%Y %H:%M')
# Search for next iteration of the alarm key
row_cleared = df.loc[(df['alarm_key'] == row_info['alarm_key']) & (
df['date'] > row_info['date'])]
if not row_cleared.empty:
# If found, get end date
end_date = row_cleared.iloc[0, 0]
alarm_info['end_date'] = end_date
end_datetime = datetime.strptime(
end_date, '%d/%m/%Y %H:%M')
else:
# If not found, set end date to last day of the month
end_datetime = start_datetime.replace(day=monthrange(
start_datetime.year, start_datetime.month)[1])
alarm_info['end_date'] = end_datetime.strftime(
'%d/%m/%Y %H:%M')
# Calculate duration of the alarm
alarm_info['duration'] = _get_unavailability_time(
start_datetime, end_datetime)
list_alarms.append(alarm_info)
list_alarms.pop(0)
df_output = pd.DataFrame(list_alarms)
For the example set in the code above, I would like a result like this one :对于上面代码中设置的示例,我想要这样的结果:
station name start_date end_date duration
0 Station1 Equip1 01/03/2022 00:01 01/03/2022 02:02 0.983333
1 Station3 Equip3 01/03/2022 00:02 31/03/2022 00:02 600.000000
I iterate through the dataframe, getting the row info.我遍历数据框,获取行信息。 If it's a Generated one, I look for the next iteration of the alarm key with a Cleared activity.如果它是 Generated 的,我会寻找具有 Cleared 活动的警报键的下一次迭代。 Once done, I store the end date in a list containing the information related to the alarm.完成后,我将结束日期存储在包含警报相关信息的列表中。 (If an alarm isn't cleared, I set the end date as the last day of the month) (如果没有清除警报,我将结束日期设置为当月的最后一天)
I don't know how to speed it up way more.我不知道如何加快速度。 (as you may see, I'm absolutely not an expert in this) (如您所见,我绝对不是这方面的专家)
If you have some suggestions to improve the process, please let me know!如果您有一些改进流程的建议,请告诉我!
IIUC, you want to split the dataset into a "generated" part and a "cleared" part - IIUC,您想将数据集拆分为“生成”部分和“清除”部分 -
df = pd.DataFrame([
['01/03/2022 00:01','5693392','CONNECTION KO','Generated','Equip1_Station1'],
['01/03/2022 00:02','5693334','CONNECTION KO','Cleared','Equip2_Station2'],
['01/03/2022 00:02','5693334','CONNECTION KO','Generated','Equip2_Station2'],
['01/03/2022 02:02','5693392','CONNECTION KO','Cleared','Equip1_Station1']
],
columns=['date','alarm_key','pcause_id_hex','activity','model_name']
)
df_gen = df[df['activity'] == 'Generated']
df_clr = df[df['activity'] == 'Cleared']
df_gen = df_gen.merge(df_clr[['date', 'alarm_key']], on=['alarm_key'], how='inner')
df_gen[['equipment', 'station']] = df_gen['model_name'].str.split('_', expand=True)
Output输出
# df_gen
date_x alarm_key pcause_id_hex activity model_name date_y equipment station
0 01/03/2022 00:01 5693392 CONNECTION KO Generated Equip1_Station1 01/03/2022 02:02 Equip1 Station1
1 01/03/2022 00:02 5693334 CONNECTION KO Generated Equip2_Station2 01/03/2022 00:02 Equip2 Station2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.