![](/img/trans.png)
[英]Pandas find overlapping time intervals in one column based on same date in another column for different rows
[英]Find overlapping time intervals based on condition in another column pandas
我已经清理了一个数据集以将其转换为这种格式。 assigned_pat_loc
t_loc 代表一个房间号,所以我试图确定两个不同的患者 ( patient_id
) 何时同时在同一个房间; 即,在具有相同assigned_pat_loc
t_loc 但不同patient_id
的行之间重叠start_time
和end_time
。 start_time
和end_time
表示特定患者在该房间的时间。 因此,如果同一房间内的两名患者的时间重叠,则意味着他们同住一个房间。 这就是我最终要寻找的。 这是我要构建这些更改的基础数据集:
patient_id assigned_pat_loc start_time end_time
0 19035648 SICU^6108 2009-01-10 18:27:48 2009-02-25 15:45:54
1 19039244 85^8520 2009-01-02 06:27:25 2009-01-05 10:38:41
2 19039507 55^5514 2009-01-01 13:25:45 2009-01-01 13:25:45
3 19039555 EIAB^EIAB 2009-01-15 01:56:48 2009-02-23 11:36:34
4 19039559 EIAB^EIAB 2009-01-16 11:24:18 2009-01-19 18:41:33
... ... ... ... ...
140906 46851413 EIAB^EIAB 2011-12-31 22:28:38 2011-12-31 23:15:49
140907 46851422 EIAB^EIAB 2011-12-31 21:52:44 2011-12-31 22:50:08
140908 46851430 4LD^4LDX 2011-12-31 22:41:10 2011-12-31 22:44:48
140909 46851434 EIC^EIC 2011-12-31 23:45:22 2011-12-31 23:45:22
140910 46851437 EIAB^EIAB 2011-12-31 22:54:40 2011-12-31 23:30:10
我在想我应该用某种 groupby 来解决这个问题,但我不确定具体如何实施。 我会尝试一下,但我花了大约 6 个小时才达到这一点,所以即使只是一些想法,我也会很感激。
编辑
原始数据示例:
id Date Time assigned_pat_loc prior_pat_loc Activity
1 May/31/11 8:00 EIAB^EIAB^6 Admission
1 May/31/11 9:00 8w^201 EIAB^EIAB^6 Transfer
1 Jun/8/11 15:00 8w^201 Discharge
2 May/31/11 5:00 EIAB^EIAB^4 Admission
2 May/31/11 7:00 10E^45 EIAB^EIAB^4 Transfer
2 Jun/1/11 1:00 8w^201 10E^45 Transfer
2 Jun/1/11 8:00 8w^201 Discharge
3 May/31/11 9:00 EIAB^EIAB^2 Admission
3 Jun/1/11 9:00 8w^201 EIAB^EIAB^2 Transfer
3 Jun/5/11 9:00 8w^201 Discharge
4 May/31/11 9:00 EIAB^EIAB^9 Admission
4 May/31/11 7:00 10E^45 EIAB^EIAB^9 Transfer
4 Jun/1/11 8:00 10E^45 Death
所需 output 的示例:
id r_id start_date start_time end_date end_time length location
1 2 Jun/1/11 1:00 Jun/1/11 8:00 7 8w^201
1 3 Jun/1/11 9:00 Jun/5/11 9:00 96 8w^201
2 4 May/31/11 7:00 Jun/1/11 1:00 18 10E^45
2 1 Jun/1/11 1:00 Jun/1/11 8:00 7 8w^201
3 1 Jun/1/11 9:00 Jun/5/11 9:00 96 8w^201
其中r_id
是与另一个患者共享同一房间的“其他”患者, length
是共享房间的时间(以小时为单位)。
在这个例子中:
numpy 广播非常适合这个。 它允许您将每条记录(病房)与 dataframe 中的每条其他记录进行比较。缩小尺寸是 memory 密集,因为它需要n^2 * 8
字节来存储比较矩阵。 浏览约 141k 行的数据,需要 148GB 的内存!
我们需要对 dataframe 进行分块,因此 memory 要求减少到chunk_size * n * 8
字节。
# Don't keep date and time separately, they are hard to
# perform calculations on. Instead, combine them into a
# single column and keep it as pd.Timestamp
df["start_time"] = pd.to_datetime(df["Date"] + " " + df["Time"])
# I don't know how you determine when a patient vacate a
# room. My logic here is
# - If Activity = Discharge or Death, end_time = start_time
# - Otherwise, end_time = start_time of the next room
# You can implement your own logic. This part is not
# essential to the problem at hand.
df["end_time"] = np.where(
df["Activity"].isin(["Discharge", "Death"]),
df["start_time"],
df.groupby("id")["start_time"].shift(-1),
)
# ------------------------------------------------------------------------------
# Extract all the columns to numpy arrays
patient_id, assigned_pat_loc, start_time, end_time = (
df[["id", "assigned_pat_loc", "start_time", "end_time"]].to_numpy().T
)
chunk_size = 1000 # experiment to find a size that suits you
idx_left = []
idx_right = []
for offset in range(0, len(df), chunk_size):
chunk = slice(offset, offset + chunk_size)
# Get a chunk of each array. The [:, None] part is to
# raise the chunk up one dimension to prepare for numpy
# broadcasting
patient_id_chunk, assigned_pat_loc_chunk, start_time_chunk, end_time_chunk = [
arr[chunk][:, None] for arr in (patient_id, assigned_pat_loc, start_time, end_time)
]
# `mask` is a matrix. If mask[i, j] == True, the patient
# in row i is sharing the room with the patient in row j
mask = (
# patent_id are different
(patient_id_chunk != patient_id)
# in the same room
& (assigned_pat_loc_chunk == assigned_pat_loc)
# start_time and end_time overlap
& (start_time_chunk < end_time)
& (start_time < end_time_chunk)
)
idx = mask.nonzero()
idx_left.extend(idx[0] + offset)
idx_right.extend(idx[1])
result = pd.concat(
[
df[["id", "assigned_pat_loc", "start_time", "end_time"]]
.iloc[idx]
.reset_index(drop=True)
for idx in [idx_left, idx_right]
],
axis=1,
keys=["patient_1", "patient_2"],
)
结果:
patient_1 patient_2
id assigned_pat_loc start_time end_time id assigned_pat_loc start_time end_time
0 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00 2 8w^201 2011-06-01 01:00:00 2011-06-01 08:00:00
1 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00 2 8w^201 2011-06-01 08:00:00 2011-06-01 08:00:00
2 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00 3 8w^201 2011-06-01 09:00:00 2011-06-05 09:00:00
3 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00 3 8w^201 2011-06-05 09:00:00 2011-06-05 09:00:00
4 2 10E^45 2011-05-31 07:00:00 2011-06-01 01:00:00 4 10E^45 2011-05-31 07:00:00 2011-06-01 08:00:00
5 2 8w^201 2011-06-01 01:00:00 2011-06-01 08:00:00 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
6 2 8w^201 2011-06-01 08:00:00 2011-06-01 08:00:00 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
7 3 8w^201 2011-06-01 09:00:00 2011-06-05 09:00:00 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
8 3 8w^201 2011-06-05 09:00:00 2011-06-05 09:00:00 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
9 4 10E^45 2011-05-31 07:00:00 2011-06-01 08:00:00 2 10E^45 2011-05-31 07:00:00 2011-06-01 01:00:00
另外的选择。
我从EDIT之后的原始数据开始,但我已经更改了这一行
4 May/31/11 9:00 EIAB^EIAB^9 Admission
到
4 May/31/11 6:00 EIAB^EIAB^9 Admission
因为我觉得入场时间应该在转机时间之前?
第一步基本上是获得与您开始使用的类似的 dataframe:
df = (
df.assign(start_time=pd.to_datetime((df["Date"] + " " + df["Time"])))
.sort_values(["id", "start_time"])
.assign(duration=lambda df: -df.groupby("id")["start_time"].diff(-1))
.loc[lambda df: df["duration"].notna()]
.assign(end_time=lambda df: df["start_time"] + df["duration"])
.rename(columns={"assigned_pat_loc": "location"})
[["id", "location", "start_time", "end_time"]]
)
示例结果:
id location start_time end_time
0 1 EIAB^EIAB^6 2011-05-31 08:00:00 2011-05-31 09:00:00
1 1 8w^201 2011-05-31 09:00:00 2011-06-08 15:00:00
3 2 EIAB^EIAB^4 2011-05-31 05:00:00 2011-05-31 07:00:00
4 2 10E^45 2011-05-31 07:00:00 2011-06-01 01:00:00
5 2 8w^201 2011-06-01 01:00:00 2011-06-01 08:00:00
7 3 EIAB^EIAB^2 2011-05-31 09:00:00 2011-06-01 09:00:00
8 3 8w^201 2011-06-01 09:00:00 2011-06-05 09:00:00
10 4 EIAB^EIAB^9 2011-05-31 06:00:00 2011-05-31 07:00:00
11 4 10E^45 2011-05-31 07:00:00 2011-06-01 08:00:00
下一步是在location
列上将df
与其自身合并,并消除id
与r_id
相同的行:
df = (
df.merge(df, on="location")
.rename(columns={"id_x": "id", "id_y": "r_id"})
.loc[lambda df: df["id"] != df["r_id"]]
)
然后最终通过m
获得具有实际重叠的行,计算重叠的持续时间,并以您正在寻找的形式带来 dataframe:
m = (
(df["start_time_x"].le(df["start_time_y"])
& df["start_time_y"].le(df["end_time_x"]))
| (df["start_time_y"].le(df["start_time_x"])
& df["start_time_x"].le(df["end_time_y"]))
)
df = (
df[m]
.assign(
start_time=lambda df: df[["start_time_x", "start_time_y"]].max(axis=1),
end_time=lambda df: df[["end_time_x", "end_time_y"]].min(axis=1),
duration=lambda df: df["end_time"] - df["start_time"]
)
.assign(
start_date=lambda df: df["start_time"].dt.date,
start_time=lambda df: df["start_time"].dt.time,
end_date=lambda df: df["end_time"].dt.date,
end_time=lambda df: df["end_time"].dt.time
)
[[
"id", "r_id",
"start_date", "start_time", "end_date", "end_time",
"duration", "location"
]]
.sort_values(["id", "r_id"]).reset_index(drop=True)
)
示例结果:
id r_id start_date start_time end_date end_time duration \
0 1 2 2011-06-01 01:00:00 2011-06-01 08:00:00 0 days 07:00:00
1 1 3 2011-06-01 09:00:00 2011-06-05 09:00:00 4 days 00:00:00
2 2 1 2011-06-01 01:00:00 2011-06-01 08:00:00 0 days 07:00:00
3 2 4 2011-05-31 07:00:00 2011-06-01 01:00:00 0 days 18:00:00
4 3 1 2011-06-01 09:00:00 2011-06-05 09:00:00 4 days 00:00:00
5 4 2 2011-05-31 07:00:00 2011-06-01 01:00:00 0 days 18:00:00
location
0 8w^201
1 8w^201
2 8w^201
3 10E^45
4 8w^201
5 10E^45
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.