[英]How to run a function on every row of a pandas DataFrame
I have a dataframe_1
as such:我有一个dataframe_1
这样的:
Index Time Label
0 0.000 ns Segment 1
1 2.749 sec baseline
2 3.459 min begin test
3 7.009 min end of test
And I would like to add multiple new rows in between each of dataframe_1
's rows, where the Time column for each new row would add an additional minute until reaching dataframe_1
's next row's time (and corresponding Label).我想在dataframe_1
的每一行之间添加多个新行,其中每个新行的 Time 列将增加一分钟,直到到达dataframe_1
的下一行时间(和相应的标签)。 For example, the above table should ultimately look like this:例如,上表最终应如下所示:
Index Time Label
0 0.000 ns Segment 1
1 2.749 sec baseline
2 00:01:02.749000 baseline + 1min
3 00:02:02.749000 baseline + 2min
4 00:03:02.749000 baseline + 3min
5 3.459 min begin test
6 00:04:27.540000 begin test + 1min
7 00:05:27.540000 begin test + 2min
8 00:06:27.540000 begin test + 3min
9 7.009 min end of test
Using Timedelta
type via pd.to_timedelta()
is perfectly fine.通过pd.to_timedelta()
使用Timedelta
类型非常好。
I thought the best way to do this would be to break up each row of dataframe_1
into its own dataframe, and then adding rows for each added minute, and then concat
ing the dataframes back together.我认为这样做是对的每一行分手的最佳方式dataframe_1
到自己的数据帧,然后添加行对每个加入分钟,然后concat
荷兰国际集团的dataframes重新走到一起。 However, I am unsure of how to accomplish this.但是,我不确定如何实现这一点。
Should I use a nested for-loop to [first] iterate over the rows of dataframe_1
and then [second] iterate over a counter so I can create new rows with added minutes?我是否应该使用嵌套的 for 循环来 [first] 遍历dataframe_1
的行,然后 [second] 遍历计数器,以便我可以创建增加分钟数的新行?
I was previously not splitting up the individual rows into new dataframes, and I was doing the second iteration like this:我以前没有将各个行拆分成新的数据帧,我正在做这样的第二次迭代:
baseline_row = df_legend[df_legend['Label'] == 'baseline']
[baseline_index] = baseline_row.index
baseline_time = baseline_row['Time']
interval_mins = 1
new_time = baseline_time + pd.Timedelta(minutes=interval_mins)
cutoff_time_np = df_legend.iloc[baseline_row.index + 1]['Time']
cutoff_time = pd.to_timedelta(cutoff_time_np)
while new_time.reset_index(drop=True).get(0) < cutoff_time.reset_index(drop=True).get(0):
new_row = baseline_row.copy()
new_row['Label'] = f'minute {interval_mins}'
new_row['Time'] = baseline_time + pd.Timedelta(minutes=interval_mins)
new_row.index = [baseline_index + interval_mins - 0.5]
df_legend = df_legend.append(new_row, ignore_index=False)
df_legend = df_legend.sort_index().reset_index(drop=True)
pdb.set_trace()
interval_mins += 1
new_time = baseline_time + pd.Timedelta(minutes=interval_mins)
But since I want to do this for each row in the original dataframe_1
, then I was thinking to split it up into separate dataframes and put it back together.但是由于我想对原始dataframe_1
中的每一行执行此操作,因此我想将其拆分为单独的数据帧并将其重新组合在一起。 I'm just not sure what the best way is to do that, especially since pandas is apparently very slow if iterating over the rows.我只是不确定最好的方法是什么,特别是因为如果遍历行,pandas 显然非常慢。
I would really appreciate some guidance.我真的很感激一些指导。
This might faster than your solution.这可能比您的解决方案更快。
df.Time = pd.to_timedelta(df.Time)
df['counts'] = df.Time.diff().apply(lambda x: x.total_seconds()) / 60
df['counts'] = np.floor(df.counts.shift(-1)).fillna(0).astype(int)
df.drop(columns='Index', inplace=True)
df
Time Label counts
0 00:00:00 Segment 1 0
1 00:00:02.749000 baseline 3
2 00:03:27.540000 begin test 3
3 00:07:00.540000 end of test 0
Then use iterrows
to get your desire output.然后使用iterrows
来获得你想要的输出。
new_df = []
for _, row in df.iterrows():
val = row.counts
if val == 0:
new_df.append(row)
else:
new_df.append(row)
new_row = row.copy()
label = row.Label
for i in range(val):
new_row = new_row.copy()
new_row.Time += pd.Timedelta('1 min')
new_row.Label = f'{label} + {i+1}min'
new_df.append(new_row)
new_df = pd.DataFrame(new_df)
new_df
Time Label counts
0 00:00:00 Segment 1 0
1 00:00:02.749000 baseline 3
1 00:01:02.749000 baseline + 1min 3
1 00:02:02.749000 baseline + 2min 3
1 00:03:02.749000 baseline + 3min 3
2 00:03:27.540000 begin test 3
2 00:04:27.540000 begin test + 1min 3
2 00:05:27.540000 begin test + 2min 3
2 00:06:27.540000 begin test + 3min 3
3 00:07:00.540000 end of test 0
I assume that you converted Time column from "number unit" format to a string representation of the time.我假设您将时间列从“数字单位”格式转换为时间的字符串表示形式。 Something like:就像是:
Time Label
Index
0 00:00:00.000 Segment 1
1 00:00:02.749 baseline
2 00:03:27.540 begin test
3 00:07:00.540 end of test
Then, to get your result:然后,得到你的结果:
Compute timNxt - the Time column shifted by 1 position and converted to datetime :计算timNxt - Time列移动 1 个位置并转换为datetime :
timNxt = pd.to_datetime(df.Time.shift(-1))
Define the following "replication" function:定义以下“复制”函数:
def myRepl(row): timCurr = pd.to_datetime(row.Time) timNext = timNxt[row.name] tbl = [[timCurr.strftime('%H:%M:%S.%f'), row.Label]] if pd.notna(timNext): n = (timNext - timCurr) // np.timedelta64(1, 'm') + 1 tbl.extend([ [(timCurr + np.timedelta64(i, 'm')).strftime('%H:%M:%S.%f'), row.Label + f' + {i}min'] for i in range(1, n)]) return pd.DataFrame(tbl, columns=row.index)
Apply it to each row of your df and concatenate results:将其应用于df 的每一行并连接结果:
result = pd.concat(df.apply(myRepl, axis=1).tolist(), ignore_index=True)
The result is:结果是:
Time Label
0 00:00:00.000000 Segment 1
1 00:00:02.749000 baseline
2 00:01:02.749000 baseline + 1min
3 00:02:02.749000 baseline + 2min
4 00:03:02.749000 baseline + 3min
5 00:03:27.540000 begin test
6 00:04:27.540000 begin test + 1min
7 00:05:27.540000 begin test + 2min
8 00:06:27.540000 begin test + 3min
9 00:07:00.540000 end of test
The resulting DataFrame has Time column also as string , but at least the fractional part of second has 6 digits everywhere.生成的 DataFrame 的Time列也为string ,但至少秒的小数部分到处都有 6 位数字。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.