[英]How to find first occurrence for each id based on datetime column with pandas?
I have seen a lot similar questions, but didn't quite find an answer to my problem.我见过很多类似的问题,但并没有完全找到我的问题的答案。 Let's say I have a df:
假设我有一个df:
sample_id tested_at test_value
1 2020-07-21 5
1 2020-07-22 4
1 2020-07-23 6
2 2020-07-26 6
2 2020-07-28 5
3 2020-07-22 4
3 2020-07-27 4
3 2020-07-30 6
The df is already sorted for ascending by tested_at
column. df 已经按
tested_at
列升序排序。 I now need to add another column first_test
which would indicate the first test value for each sample_id
in every line, regardless if it is highest or not.我现在需要添加另一列
first_test
来指示每一行中每个sample_id
的第一个测试值,无论它是否最高。 The output should be: output 应该是:
sample_id tested_at test_value first_test
1 2020-07-21 5 5
1 2020-07-22 4 5
1 2020-07-23 6 5
2 2020-07-26 6 6
2 2020-07-28 5 6
3 2020-07-22 4 4
3 2020-07-27 4 4
3 2020-07-30 6 4
The df is also quite big, so a faster way would be very appreaciated. df 也很大,所以更快的方法会非常受欢迎。
You can use pandas' groupby
to group by sample ID, and then use the transform
method to get the first value per sample ID.您可以使用 pandas 的
groupby
按样本 ID 分组,然后使用transform
方法获取每个样本 ID 的第一个值。 Note that this takes the first value by row number, not the first value by date, so make sure the rows are ordered by date.请注意,这将按行号获取第一个值,而不是按日期获取第一个值,因此请确保行按日期排序。
df = pd.DataFrame(
[
[1, "2020-07-21", 5],
[1, "2020-07-22", 4],
[1, "2020-07-23", 6],
[2, "2020-07-26", 6],
[2, "2020-07-28", 5],
[3, "2020-07-22", 4],
[3, "2020-07-27", 4],
[3, "2020-07-30", 6],
],
columns=["sample_id", "tested_at", "test_value"],
)
df["first_test"] = df.groupby("sample_id")["test_value"].transform("first")
Which results in:结果是:
sample_id tested_at test_value first_test
0 1 2020-07-21 5 5
1 1 2020-07-22 4 5
2 1 2020-07-23 6 5
3 2 2020-07-26 6 6
4 2 2020-07-28 5 6
5 3 2020-07-22 4 4
6 3 2020-07-27 4 4
7 3 2020-07-30 6 4
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.