简体   繁体   English

如何在 Pandas DataFrame 的每一行上运行一个函数

[英]How to run a function on every row of a pandas DataFrame

I have a dataframe_1 as such:我有一个dataframe_1这样的:

Index   Time          Label
0       0.000 ns      Segment 1
1       2.749 sec     baseline
2       3.459 min     begin test
3       7.009 min     end of test

And I would like to add multiple new rows in between each of dataframe_1 's rows, where the Time column for each new row would add an additional minute until reaching dataframe_1 's next row's time (and corresponding Label).我想在dataframe_1的每一行之间添加多个新行,其中每个新行的 Time 列将增加一分钟,直到到达dataframe_1的下一行时间(和相应的标签)。 For example, the above table should ultimately look like this:例如,上表最终应如下所示:

Index     Time               Label
0         0.000 ns           Segment 1
1         2.749 sec          baseline
2         00:01:02.749000    baseline + 1min
3         00:02:02.749000    baseline + 2min
4         00:03:02.749000    baseline + 3min
5         3.459 min          begin test
6         00:04:27.540000    begin test + 1min
7         00:05:27.540000    begin test + 2min
8         00:06:27.540000    begin test + 3min
9         7.009 min          end of test

Using Timedelta type via pd.to_timedelta() is perfectly fine.通过pd.to_timedelta()使用Timedelta类型非常好。

I thought the best way to do this would be to break up each row of dataframe_1 into its own dataframe, and then adding rows for each added minute, and then concat ing the dataframes back together.我认为这样做是对的每一行分手的最佳方式dataframe_1到自己的数据帧,然后添加行对每个加入分钟,然后concat荷兰国际集团的dataframes重新走到一起。 However, I am unsure of how to accomplish this.但是,我不确定如何实现这一点。

Should I use a nested for-loop to [first] iterate over the rows of dataframe_1 and then [second] iterate over a counter so I can create new rows with added minutes?我是否应该使用嵌套的 for 循环来 [first] 遍历dataframe_1的行,然后 [second] 遍历计数器,以便我可以创建增加分钟数的新行?

I was previously not splitting up the individual rows into new dataframes, and I was doing the second iteration like this:我以前没有将各个行拆分成新的数据帧,我正在做这样的第二次迭代:

    baseline_row = df_legend[df_legend['Label'] == 'baseline']
    [baseline_index] = baseline_row.index
    baseline_time = baseline_row['Time']

    interval_mins = 1
    new_time = baseline_time + pd.Timedelta(minutes=interval_mins)

    cutoff_time_np = df_legend.iloc[baseline_row.index + 1]['Time']
    cutoff_time = pd.to_timedelta(cutoff_time_np)
    
    while new_time.reset_index(drop=True).get(0) < cutoff_time.reset_index(drop=True).get(0):

        new_row = baseline_row.copy()
        new_row['Label'] = f'minute {interval_mins}'
        new_row['Time'] = baseline_time + pd.Timedelta(minutes=interval_mins)
        new_row.index = [baseline_index + interval_mins - 0.5]

        df_legend = df_legend.append(new_row, ignore_index=False)
        df_legend = df_legend.sort_index().reset_index(drop=True)
        pdb.set_trace()

        interval_mins += 1
        new_time = baseline_time + pd.Timedelta(minutes=interval_mins)

But since I want to do this for each row in the original dataframe_1 , then I was thinking to split it up into separate dataframes and put it back together.但是由于我想对原始dataframe_1中的每一行执行此操作,因此我想将其拆分为单独的数据帧并将其重新组合在一起。 I'm just not sure what the best way is to do that, especially since pandas is apparently very slow if iterating over the rows.我只是不确定最好的方法是什么,特别是因为如果遍历行,pandas 显然非常慢。

I would really appreciate some guidance.我真的很感激一些指导。

This might faster than your solution.这可能比您的解决方案更快。

df.Time = pd.to_timedelta(df.Time)
df['counts'] = df.Time.diff().apply(lambda x: x.total_seconds()) / 60
df['counts'] = np.floor(df.counts.shift(-1)).fillna(0).astype(int)
df.drop(columns='Index', inplace=True)

df

             Time        Label  counts
0        00:00:00    Segment 1       0
1 00:00:02.749000     baseline       3
2 00:03:27.540000   begin test       3
3 00:07:00.540000  end of test       0

Then use iterrows to get your desire output.然后使用iterrows来获得你想要的输出。

new_df = []
for _, row in df.iterrows():
    val = row.counts
    if val == 0:
        new_df.append(row)
    else:
        new_df.append(row)
        new_row = row.copy()
        label = row.Label
        for i in range(val):
            new_row = new_row.copy()
            new_row.Time += pd.Timedelta('1 min')
            new_row.Label = f'{label} + {i+1}min'
            new_df.append(new_row)

new_df = pd.DataFrame(new_df)
new_df

             Time              Label  counts
0        00:00:00          Segment 1       0
1 00:00:02.749000           baseline       3
1 00:01:02.749000    baseline + 1min       3
1 00:02:02.749000    baseline + 2min       3
1 00:03:02.749000    baseline + 3min       3
2 00:03:27.540000         begin test       3
2 00:04:27.540000  begin test + 1min       3
2 00:05:27.540000  begin test + 2min       3
2 00:06:27.540000  begin test + 3min       3
3 00:07:00.540000        end of test       0

I assume that you converted Time column from "number unit" format to a string representation of the time.我假设您将时间列从“数字单位”格式转换为时间的字符串表示形式。 Something like:就像是:

               Time        Label
Index                           
0      00:00:00.000    Segment 1
1      00:00:02.749     baseline
2      00:03:27.540   begin test
3      00:07:00.540  end of test

Then, to get your result:然后,得到你的结果:

  1. Compute timNxt - the Time column shifted by 1 position and converted to datetime :计算timNxt - Time列移动 1 个位置并转换为datetime

     timNxt = pd.to_datetime(df.Time.shift(-1))
  2. Define the following "replication" function:定义以下“复制”函数:

     def myRepl(row): timCurr = pd.to_datetime(row.Time) timNext = timNxt[row.name] tbl = [[timCurr.strftime('%H:%M:%S.%f'), row.Label]] if pd.notna(timNext): n = (timNext - timCurr) // np.timedelta64(1, 'm') + 1 tbl.extend([ [(timCurr + np.timedelta64(i, 'm')).strftime('%H:%M:%S.%f'), row.Label + f' + {i}min'] for i in range(1, n)]) return pd.DataFrame(tbl, columns=row.index)
  3. Apply it to each row of your df and concatenate results:将其应用于df 的每一行并连接结果:

     result = pd.concat(df.apply(myRepl, axis=1).tolist(), ignore_index=True)

The result is:结果是:

              Time              Label
0  00:00:00.000000          Segment 1
1  00:00:02.749000           baseline
2  00:01:02.749000    baseline + 1min
3  00:02:02.749000    baseline + 2min
4  00:03:02.749000    baseline + 3min
5  00:03:27.540000         begin test
6  00:04:27.540000  begin test + 1min
7  00:05:27.540000  begin test + 2min
8  00:06:27.540000  begin test + 3min
9  00:07:00.540000        end of test

The resulting DataFrame has Time column also as string , but at least the fractional part of second has 6 digits everywhere.生成的 DataFrame 的Time列也为string ,但至少秒的小数部分到处都有 6 位数字。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM