[英]Efficient way of calculating amount of concurrent calls by one user to a distinct phone number using python pandas?
我有大量用户拨打不同电话号码的数据框
calls = {
'user': ['a', 'b', 'b', 'b', 'c', 'c'],
'number': ['+1 11', '+2 22', '+2 22', '+1 11', '+4 44', '+1 11'],
'start_time': ['00:00:00', '00:02:00', '00:03:00', '00:00:00', '00:00:00', '00:00:00'],
'end_time': ['00:05:00', '00:03:01', '00:05:00', '00:05:00', '00:02:00', '00:02:00']
}
df = pd.DataFrame(calls)
用户 | 数字 | 时间开始 | 时间结束 | |
---|---|---|---|---|
0 | 一个 | 1 11 | 00:00:00 | 00:05:00 |
1 | b | 2 22 | 00:02:00 | 00:03:01 |
2 | b | 2 22 | 00:03:00 | 00:05:00 |
3 | b | 1 11 | 00:00:00 | 00:05:00 |
4 | C | 4 44 | 00:00:00 | 00:02:00 |
5 | C | 1 11 | 00:00:00 | 00:02:00 |
我正在尝试计算从一个用户到不同数字的最大并发(并行)调用数:
res = pd.DataFrame([])
grouped_by_user = df.groupby(['user'])
user_dict = defaultdict(lambda: {'number_dict': None})
for user in grouped_by_user.groups:
user_group = grouped_by_user.get_group(user)
grouped_by_number = user_group.groupby(['number'])
number_dict = defaultdict(lambda: {'max_calls': None})
for number in grouped_by_number.groups:
number_group = grouped_by_number.get_group(number)
calls = []
for i in number_group.index:
calls.append(len(number_group[(number_group["start_time"] <= number_group.loc[i, "start_time"]) & (number_group["end_time"] > number_group.loc[i, "start_time"])]))
number_dict[number]['max_calls'] = max(calls)
user_dict[user]['number_dict'] = number_dict
tmp_list = []
for num, calls in number_dict.items():
tmp_list.append([user, num, calls['max_calls']])
res = res.append(tmp_list, ignore_index=True)
生成的数据框如下所示:
用户 | 数字 | 最大限度 | |
---|---|---|---|
0 | 一个 | 1 11 | 1 |
1 | b | 1 11 | 1 |
2 | b | 2 22 | 2 |
3 | C | 1 11 | 1 |
4 | C | 4 44 | 1 |
但是对于大型数据帧,此代码非常慢。 有更好的方法吗? 或者如何提高这段代码的时间效率?
尝试:
df["start_time"] = pd.to_datetime(df["start_time"])
df["end_time"] = pd.to_datetime(df["end_time"])
def fn(x):
x["tmp1"] = x.apply(
lambda y: pd.date_range(y["start_time"], y["end_time"], freq="1s"),
axis=1,
)
x = x.explode("tmp1")
return (
x.loc[x.duplicated(subset=["tmp1"], keep=False), "tmp1"]
.value_counts()
.max()
)
print(
df.groupby(["user", "number"])
.apply(fn)
.to_frame(name="max")
.reset_index()
.fillna(1)
)
印刷:
user number max
0 a +1 11 1.0
1 b +1 11 1.0
2 b +2 22 2.0
3 c +1 11 1.0
4 c +4 44 1.0
和
calls = {
"user": ["a", "b", "b", "b", "c"],
"number": ["+1 11", "+1 11", "+1 11", "+1 11", "+1 11"],
"start_time": ["00:00:00", "00:04:00", "00:00:00", "00:03:00", "00:00:00"],
"end_time": ["00:05:00", "00:08:00", "00:05:00", "00:05:30", "00:02:00"],
}
印刷:
user number max
0 a +1 11 1.0
1 b +1 11 3.0
2 c +1 11 1.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.