Pandas - 如何在巨大的数据帧上加快进程

Question

I'm working on a script that calculate time unavailability of some equipment I maintain.我正在编写一个脚本来计算我维护的某些设备的时间不可用性。

I have as an input a csv file of our supervising tool (around 2M lines), containing the alarms for a month.我有一个我们监督工具的 csv 文件（大约 2M 行）作为输入，其中包含一个月的警报。

Problem is, it takes a huge time to process it!问题是，处理它需要大量时间！

Once converted as a Pandas DF, I have a df with these columns:一旦转换为 Pandas DF，我就有一个包含这些列的 df：

['date','alarm_key','pcause_id_hex','activity','model_name']

Date : timestamp of the alarm日期：警报的时间戳
Alarm_key : id of the alarm Alarm_key : 闹钟的id
Pcause_id_hex : description of the alarm Pcause_id_hex : 告警描述
Activity : Generated / Cleared (Generated means the alarm started, and Cleared means it ended) Activity : Generated / Cleared (Generated 表示警报开始，Cleared 表示结束)
Model_name : name of the equipment Model_name : 设备名称

The alarm_key is the same when it's generated and when it's cleared. alarm_key 生成时和清除时是一样的。

I want as an output a new dataframe which contains :我想要一个新的数据框作为输出，其中包含：

['station','name','start_date','end_date','duration']

Station and Name : I get it from the model_name Station and Name ：我从 model_name 得到它
Start_date : date of the "Generated" alarm Start_date ：“生成”警报的日期
End_date : date of the "Cleared" alarm End_date ：“清除”警报的日期
Duration : I have a function that calculates it持续时间：我有一个计算它的函数

Below is my code:下面是我的代码：

df = pd.DataFrame([
        ['01/03/2022 00:01','5693392','CONNECTION KO','Generated','Equip1_Station1'],
        ['01/03/2022 00:02','5693334','CONNECTION KO','Cleared','Equip2_Station2'],
        ['01/03/2022 00:02','5693352','CONNECTION KO','Generated','Equip3_Station3'],
        ['01/03/2022 02:02','5693392','CONNECTION KO','Cleared','Equip1_Station1']
    ],
        columns=['date','alarm_key','pcause_id_hex','activity','model_name']
    )

list_alarms = [{}]

for i, row in df.iterrows():
    # Process row information
    row_info = {
        'date': row['date'],
        'alarm_key': row['alarm_key'],
        'pcause_id_hex': row['pcause_id_hex'],
        'activity': row['activity'],
        'model_name': row['model_name'],
    }
    # Check if it's a generated alarm
    if row_info['activity'] == 'Generated':
        alarm_info = {
            'station': '',
            'name': '',
            'start_date': '',
            'end_date': '',
            'duration': 0
        }
        # Fill name / station info
        if re.search('_', row_info['model_name']):
            alarm_info['name'] = row_info['model_name'].split('_', 1)[
                0]
            alarm_info['station'] = row_info['model_name'].split('_', 1)[
                1]
        else:
            alarm_info['name'] = ''
            alarm_info['station'] = row_info['model_name']

        # Fill start date
        alarm_info['start_date'] = row_info['date']
        start_datetime = datetime.strptime(
            row_info['date'], '%d/%m/%Y %H:%M')

        # Search for next iteration of the alarm key
        row_cleared = df.loc[(df['alarm_key'] == row_info['alarm_key']) & (
            df['date'] > row_info['date'])]
        if not row_cleared.empty:
            # If found, get end date
            end_date = row_cleared.iloc[0, 0]
            alarm_info['end_date'] = end_date
            end_datetime = datetime.strptime(
                end_date, '%d/%m/%Y %H:%M')
        else:
            # If not found, set end date to last day of the month
            end_datetime = start_datetime.replace(day=monthrange(
                start_datetime.year, start_datetime.month)[1])
            alarm_info['end_date'] = end_datetime.strftime(
                '%d/%m/%Y %H:%M')
        # Calculate duration of the alarm
        alarm_info['duration'] = _get_unavailability_time(
            start_datetime, end_datetime)
        list_alarms.append(alarm_info)

list_alarms.pop(0)
df_output = pd.DataFrame(list_alarms)

For the example set in the code above, I would like a result like this one :对于上面代码中设置的示例，我想要这样的结果：

    station    name        start_date          end_date    duration
0  Station1  Equip1  01/03/2022 00:01  01/03/2022 02:02    0.983333
1  Station3  Equip3  01/03/2022 00:02  31/03/2022 00:02  600.000000

I iterate through the dataframe, getting the row info.我遍历数据框，获取行信息。 If it's a Generated one, I look for the next iteration of the alarm key with a Cleared activity.如果它是 Generated 的，我会寻找具有 Cleared 活动的警报键的下一次迭代。 Once done, I store the end date in a list containing the information related to the alarm.完成后，我将结束日期存储在包含警报相关信息的列表中。 (If an alarm isn't cleared, I set the end date as the last day of the month) （如果没有清除警报，我将结束日期设置为当月的最后一天）

I don't know how to speed it up way more.我不知道如何加快速度。 (as you may see, I'm absolutely not an expert in this) （如您所见，我绝对不是这方面的专家）

If you have some suggestions to improve the process, please let me know!如果您有一些改进流程的建议，请告诉我！

Answer 1

IIUC, you want to split the dataset into a "generated" part and a "cleared" part - IIUC，您想将数据集拆分为“生成”部分和“清除”部分 -

df = pd.DataFrame([
        ['01/03/2022 00:01','5693392','CONNECTION KO','Generated','Equip1_Station1'],
                ['01/03/2022 00:02','5693334','CONNECTION KO','Cleared','Equip2_Station2'],
                        ['01/03/2022 00:02','5693334','CONNECTION KO','Generated','Equip2_Station2'],
                                ['01/03/2022 02:02','5693392','CONNECTION KO','Cleared','Equip1_Station1']
                                    ],
                                            columns=['date','alarm_key','pcause_id_hex','activity','model_name']
                                                )

df_gen = df[df['activity'] == 'Generated']
df_clr = df[df['activity'] == 'Cleared']
df_gen = df_gen.merge(df_clr[['date', 'alarm_key']], on=['alarm_key'], how='inner')
df_gen[['equipment', 'station']] = df_gen['model_name'].str.split('_', expand=True)

Output输出

# df_gen
             date_x alarm_key  pcause_id_hex   activity       model_name            date_y equipment   station
0  01/03/2022 00:01   5693392  CONNECTION KO  Generated  Equip1_Station1  01/03/2022 02:02    Equip1  Station1
1  01/03/2022 00:02   5693334  CONNECTION KO  Generated  Equip2_Station2  01/03/2022 00:02    Equip2  Station2

Pandas - 如何在巨大的数据帧上加快进程

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-06-22 17:37:08

Pandas - 如何在巨大的数据帧上加快进程

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-06-22 17:37:08

解决方案1
0 已采纳 2022-06-22 17:37:08