如何使用 pandas 根据日期时间列查找每个 id 的第一次出现？

Question

I have seen a lot similar questions, but didn't quite find an answer to my problem.我见过很多类似的问题，但并没有完全找到我的问题的答案。 Let's say I have a df:假设我有一个df：

    sample_id     tested_at   test_value
            1    2020-07-21            5
            1    2020-07-22            4
            1    2020-07-23            6
            2    2020-07-26            6
            2    2020-07-28            5
            3    2020-07-22            4
            3    2020-07-27            4
            3    2020-07-30            6

The df is already sorted for ascending by tested_at column. df 已经按tested_at列升序排序。 I now need to add another column first_test which would indicate the first test value for each sample_id in every line, regardless if it is highest or not.我现在需要添加另一列first_test来指示每一行中每个sample_id的第一个测试值，无论它是否最高。 The output should be: output 应该是：

    sample_id     tested_at   test_value   first_test
            1    2020-07-21            5            5
            1    2020-07-22            4            5
            1    2020-07-23            6            5
            2    2020-07-26            6            6
            2    2020-07-28            5            6
            3    2020-07-22            4            4
            3    2020-07-27            4            4
            3    2020-07-30            6            4

The df is also quite big, so a faster way would be very appreaciated. df 也很大，所以更快的方法会非常受欢迎。

Answer 1

You can use pandas' groupby to group by sample ID, and then use the transform method to get the first value per sample ID.您可以使用 pandas 的groupby按样本 ID 分组，然后使用transform方法获取每个样本 ID 的第一个值。 Note that this takes the first value by row number, not the first value by date, so make sure the rows are ordered by date.请注意，这将按行号获取第一个值，而不是按日期获取第一个值，因此请确保行按日期排序。

df = pd.DataFrame(
    [
        [1, "2020-07-21", 5],
        [1, "2020-07-22", 4],
        [1, "2020-07-23", 6],
        [2, "2020-07-26", 6],
        [2, "2020-07-28", 5],
        [3, "2020-07-22", 4],
        [3, "2020-07-27", 4],
        [3, "2020-07-30", 6],
    ],
    columns=["sample_id", "tested_at", "test_value"],
)

df["first_test"] = df.groupby("sample_id")["test_value"].transform("first")

Which results in:结果是：

   sample_id   tested_at  test_value  first_test
0          1  2020-07-21    5           5
1          1  2020-07-22    4           5
2          1  2020-07-23    6           5
3          2  2020-07-26    6           6
4          2  2020-07-28    5           6
5          3  2020-07-22    4           4
6          3  2020-07-27    4           4
7          3  2020-07-30    6           4

如何使用 pandas 根据日期时间列查找每个 id 的第一次出现？

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-08-07 08:45:29

如何使用 pandas 根据日期时间列查找每个 id 的第一次出现？

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-08-07 08:45:29

解决方案1
2 已采纳 2020-08-07 08:45:29