根据另一列中的条件查找重叠的时间间隔 pandas

Question

我已经清理了一个数据集以将其转换为这种格式。 assigned_pat_loc t_loc 代表一个房间号，所以我试图确定两个不同的患者 ( patient_id ) 何时同时在同一个房间； 即，在具有相同assigned_pat_loc t_loc 但不同patient_id的行之间重叠start_time和end_time 。 start_time和end_time表示特定患者在该房间的时间。 因此，如果同一房间内的两名患者的时间重叠，则意味着他们同住一个房间。 这就是我最终要寻找的。 这是我要构建这些更改的基础数据集：

      patient_id    assigned_pat_loc    start_time          end_time
0     19035648      SICU^6108           2009-01-10 18:27:48 2009-02-25 15:45:54
1     19039244      85^8520             2009-01-02 06:27:25 2009-01-05 10:38:41
2     19039507      55^5514             2009-01-01 13:25:45 2009-01-01 13:25:45
3     19039555      EIAB^EIAB           2009-01-15 01:56:48 2009-02-23 11:36:34
4     19039559      EIAB^EIAB           2009-01-16 11:24:18 2009-01-19 18:41:33
... ... ... ... ...
140906 46851413     EIAB^EIAB           2011-12-31 22:28:38 2011-12-31 23:15:49
140907 46851422     EIAB^EIAB           2011-12-31 21:52:44 2011-12-31 22:50:08
140908 46851430     4LD^4LDX            2011-12-31 22:41:10 2011-12-31 22:44:48
140909 46851434     EIC^EIC             2011-12-31 23:45:22 2011-12-31 23:45:22
140910 46851437     EIAB^EIAB           2011-12-31 22:54:40 2011-12-31 23:30:10

我在想我应该用某种 groupby 来解决这个问题，但我不确定具体如何实施。 我会尝试一下，但我花了大约 6 个小时才达到这一点，所以即使只是一些想法，我也会很感激。

编辑

原始数据示例：

id  Date    Time        assigned_pat_loc    prior_pat_loc   Activity
1   May/31/11   8:00    EIAB^EIAB^6                         Admission
1   May/31/11   9:00    8w^201              EIAB^EIAB^6     Transfer 
1   Jun/8/11    15:00   8w^201                              Discharge
2   May/31/11   5:00    EIAB^EIAB^4                         Admission 
2   May/31/11   7:00    10E^45              EIAB^EIAB^4     Transfer
2   Jun/1/11    1:00    8w^201              10E^45          Transfer
2   Jun/1/11    8:00    8w^201                              Discharge
3   May/31/11   9:00    EIAB^EIAB^2                         Admission
3   Jun/1/11    9:00    8w^201              EIAB^EIAB^2     Transfer
3   Jun/5/11    9:00    8w^201                              Discharge
4   May/31/11   9:00    EIAB^EIAB^9                         Admission
4   May/31/11   7:00    10E^45              EIAB^EIAB^9     Transfer
4   Jun/1/11    8:00    10E^45                              Death

所需 output 的示例：

id  r_id    start_date  start_time  end_date    end_time    length  location 
1   2       Jun/1/11    1:00        Jun/1/11    8:00        7   8w^201
1   3       Jun/1/11    9:00        Jun/5/11    9:00        96  8w^201
2   4       May/31/11   7:00        Jun/1/11    1:00        18  10E^45
2   1       Jun/1/11    1:00        Jun/1/11    8:00        7   8w^201
3   1       Jun/1/11    9:00        Jun/5/11    9:00        96  8w^201

其中r_id是与另一个患者共享同一房间的“其他”患者， length是共享房间的时间（以小时为单位）。

在这个例子中：

r_id 是您将为其他患者的 ID 生成的变量的名称。
患者1有两次合房发作，均在8w^201（8w单元的201房间）； 他与患者 2 同住 7 小时（6 月 1 日凌晨 1 点至 8 点），与患者 3 同住 96 小时（6 月 1 日上午 9 点至 6 月 5 日上午 9 点）。
患者 2 也有两次房间共享事件。 第一个是在 10E^45（单元 10E 的 45 房间）与患者 4 一起，持续了 18 个小时（5 月 31 日早上 7 点到 6 月 1 日凌晨 1 点）； 第二个是 8w^201 中患者 1 的 7 小时剧集。
患者 3 在 8w^201 房间与患者 1 仅发生过一次房间共享事件，持续 96 小时。
患者 4 也只有一次房间共享事件，患者 2 在 10E^45 房间，持续 18 小时。
注意：房间共享事件被列出两次，每个患者一次。

Answer 1

numpy 广播非常适合这个。 它允许您将每条记录（病房）与 dataframe 中的每条其他记录进行比较。缩小尺寸是 memory 密集，因为它需要n^2 * 8字节来存储比较矩阵。 浏览约 141k 行的数据，需要 148GB 的内存！

我们需要对 dataframe 进行分块，因此 memory 要求减少到chunk_size * n * 8字节。

# Don't keep date and time separately, they are hard to
# perform calculations on. Instead, combine them into a
# single column and keep it as pd.Timestamp
df["start_time"] = pd.to_datetime(df["Date"] + " " + df["Time"])

# I don't know how you determine when a patient vacate a
# room. My logic here is
#   - If Activity = Discharge or Death, end_time = start_time
#   - Otherwise, end_time = start_time of the next room
# You can implement your own logic. This part is not
# essential to the problem at hand.
df["end_time"] = np.where(
    df["Activity"].isin(["Discharge", "Death"]),
    df["start_time"],
    df.groupby("id")["start_time"].shift(-1),
)

# ------------------------------------------------------------------------------

# Extract all the columns to numpy arrays
patient_id, assigned_pat_loc, start_time, end_time = (
    df[["id", "assigned_pat_loc", "start_time", "end_time"]].to_numpy().T
)

chunk_size = 1000 # experiment to find a size that suits you
idx_left = []
idx_right = []

for offset in range(0, len(df), chunk_size):
    chunk = slice(offset, offset + chunk_size)

    # Get a chunk of each array. The [:, None] part is to
    # raise the chunk up one dimension to prepare for numpy
    # broadcasting
    patient_id_chunk, assigned_pat_loc_chunk, start_time_chunk, end_time_chunk = [
        arr[chunk][:, None] for arr in (patient_id, assigned_pat_loc, start_time, end_time)
    ]

    # `mask` is a matrix. If mask[i, j] == True, the patient
    # in row i is sharing the room with the patient in row j
    mask = (
        # patent_id are different
        (patient_id_chunk != patient_id)
        # in the same room
        & (assigned_pat_loc_chunk == assigned_pat_loc)
        # start_time and end_time overlap
        & (start_time_chunk < end_time)
        & (start_time < end_time_chunk)
    )

    idx = mask.nonzero()
    idx_left.extend(idx[0] + offset)
    idx_right.extend(idx[1])

result = pd.concat(
    [
        df[["id", "assigned_pat_loc", "start_time", "end_time"]]
        .iloc[idx]
        .reset_index(drop=True)
        for idx in [idx_left, idx_right]
    ],
    axis=1,
    keys=["patient_1", "patient_2"],
)

结果：

  patient_1                                                          patient_2                                                         
         id assigned_pat_loc          start_time            end_time        id assigned_pat_loc          start_time            end_time
0         1           8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00         2           8w^201 2011-06-01 01:00:00 2011-06-01 08:00:00
1         1           8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00         2           8w^201 2011-06-01 08:00:00 2011-06-01 08:00:00
2         1           8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00         3           8w^201 2011-06-01 09:00:00 2011-06-05 09:00:00
3         1           8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00         3           8w^201 2011-06-05 09:00:00 2011-06-05 09:00:00
4         2           10E^45 2011-05-31 07:00:00 2011-06-01 01:00:00         4           10E^45 2011-05-31 07:00:00 2011-06-01 08:00:00
5         2           8w^201 2011-06-01 01:00:00 2011-06-01 08:00:00         1           8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
6         2           8w^201 2011-06-01 08:00:00 2011-06-01 08:00:00         1           8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
7         3           8w^201 2011-06-01 09:00:00 2011-06-05 09:00:00         1           8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
8         3           8w^201 2011-06-05 09:00:00 2011-06-05 09:00:00         1           8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
9         4           10E^45 2011-05-31 07:00:00 2011-06-01 08:00:00         2           10E^45 2011-05-31 07:00:00 2011-06-01 01:00:00

Answer 2

另外的选择。

我从EDIT之后的原始数据开始，但我已经更改了这一行

4   May/31/11   9:00    EIAB^EIAB^9                         Admission

到

4   May/31/11   6:00    EIAB^EIAB^9                         Admission

因为我觉得入场时间应该在转机时间之前？

第一步基本上是获得与您开始使用的类似的 dataframe：

df = (
    df.assign(start_time=pd.to_datetime((df["Date"] + " " + df["Time"])))
    .sort_values(["id", "start_time"])
    .assign(duration=lambda df: -df.groupby("id")["start_time"].diff(-1))
    .loc[lambda df: df["duration"].notna()]
    .assign(end_time=lambda df: df["start_time"] + df["duration"])
    .rename(columns={"assigned_pat_loc": "location"})
    [["id", "location", "start_time", "end_time"]]
)

示例结果：

    id     location          start_time            end_time
0    1  EIAB^EIAB^6 2011-05-31 08:00:00 2011-05-31 09:00:00
1    1       8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
3    2  EIAB^EIAB^4 2011-05-31 05:00:00 2011-05-31 07:00:00
4    2       10E^45 2011-05-31 07:00:00 2011-06-01 01:00:00
5    2       8w^201 2011-06-01 01:00:00 2011-06-01 08:00:00
7    3  EIAB^EIAB^2 2011-05-31 09:00:00 2011-06-01 09:00:00
8    3       8w^201 2011-06-01 09:00:00 2011-06-05 09:00:00
10   4  EIAB^EIAB^9 2011-05-31 06:00:00 2011-05-31 07:00:00
11   4       10E^45 2011-05-31 07:00:00 2011-06-01 08:00:00

下一步是在location列上将df与其自身合并，并消除id与r_id相同的行：

df = (
    df.merge(df, on="location")
    .rename(columns={"id_x": "id", "id_y": "r_id"})
    .loc[lambda df: df["id"] != df["r_id"]]
)

然后最终通过m获得具有实际重叠的行，计算重叠的持续时间，并以您正在寻找的形式带来 dataframe：

m = (
    (df["start_time_x"].le(df["start_time_y"])
     & df["start_time_y"].le(df["end_time_x"]))
    | (df["start_time_y"].le(df["start_time_x"])
       & df["start_time_x"].le(df["end_time_y"]))
)
df = (
    df[m]
    .assign(
        start_time=lambda df: df[["start_time_x", "start_time_y"]].max(axis=1),
        end_time=lambda df: df[["end_time_x", "end_time_y"]].min(axis=1),
        duration=lambda df: df["end_time"] - df["start_time"]
    )
    .assign(
        start_date=lambda df: df["start_time"].dt.date,
        start_time=lambda df: df["start_time"].dt.time,
        end_date=lambda df: df["end_time"].dt.date,
        end_time=lambda df: df["end_time"].dt.time
    )
    [[
        "id", "r_id",
        "start_date", "start_time", "end_date", "end_time",
        "duration", "location"
    ]]
    .sort_values(["id", "r_id"]).reset_index(drop=True)
)

示例结果：

   id  r_id  start_date start_time    end_date  end_time        duration  \
0   1     2  2011-06-01   01:00:00  2011-06-01  08:00:00 0 days 07:00:00   
1   1     3  2011-06-01   09:00:00  2011-06-05  09:00:00 4 days 00:00:00   
2   2     1  2011-06-01   01:00:00  2011-06-01  08:00:00 0 days 07:00:00   
3   2     4  2011-05-31   07:00:00  2011-06-01  01:00:00 0 days 18:00:00   
4   3     1  2011-06-01   09:00:00  2011-06-05  09:00:00 4 days 00:00:00   
5   4     2  2011-05-31   07:00:00  2011-06-01  01:00:00 0 days 18:00:00   

  location  
0   8w^201  
1   8w^201  
2   8w^201  
3   10E^45  
4   8w^201  
5   10E^45

根据另一列中的条件查找重叠的时间间隔 pandas

问题描述

2 个解决方案

解决方案1
2 2023-01-27 13:15:11

解决方案2
2 已采纳 2023-01-27 14:22:27

根据另一列中的条件查找重叠的时间间隔 pandas

问题描述

2 个解决方案

解决方案1 2 2023-01-27 13:15:11

解决方案2 2 已采纳 2023-01-27 14:22:27

解决方案1
2 2023-01-27 13:15:11

解决方案2
2 已采纳 2023-01-27 14:22:27