[英]Python pandas: Accessing data from multiple data frame based on condition
I have to calculate a metric that requires me to find the attributes of the same 'user' from multiple columns.我必须计算一个指标,该指标要求我从多个列中查找同一“用户”的属性。 For example, I have two data frames shown below:
例如,我有两个数据框如下所示:
calls_per_month.head(10)
user_id month call_date
0 1000 12 16
1 1001 8 27
2 1001 9 49
3 1001 10 65
4 1001 11 64
5 1001 12 56
6 1002 10 11
7 1002 11 55
8 1002 12 47
9 1003 12 149
internet_per_month.head(10)
user_id session_date mb_used
0 1000 12 2000.0
1 1001 8 7000.0
2 1001 9 14000.0
3 1001 10 23000.0
4 1001 11 19000.0
5 1001 12 20000.0
6 1002 10 7000.0
7 1002 11 20000.0
8 1002 12 15000.0
9 1003 12 28000.0
I want to calculate a metric that would look something like this for each user_id for every month they used the internet or made a call: `usage = mb_used + call_date' and it would be a column that would look like ( I have done hand calculation):我想为他们使用互联网或拨打电话的每个月的每个 user_id 计算一个类似这样的指标:`usage = mb_used + call_date',它将是一个看起来像的列(我已经完成了手工计算):
user_id month usage
0 1000 12 2016
1 1001 8 7027
2 1001 9 14049
3 1001 10 23065
4 1001 11 19064
5 1001 12 20056
6 1002 10 7011
7 1002 11 20055
8 1002 12 15047
9 1003 12 28149
The head of the above I showed does not show it, but there are some users who did not make a call in a particular month but used data, so I have to account for that, in the sense it should not ignore those users and just add 0 for the data not available.我上面展示的那个头没有显示,但是有一些用户在特定月份没有打电话但使用了数据,所以我必须考虑到这一点,从某种意义上说它不应该忽略这些用户而只是不可用的数据加0。
Should I first do an outer join of the tables?我应该首先对表进行外部联接吗? Or is creating a new table not the correct way to do it?
或者创建一个新表不是正确的方法吗? Any guidance is appreciated.
任何指导表示赞赏。
Thank you谢谢
You should merge or join these first, then do the operation.您应该先合并或加入这些,然后再进行操作。 Here I'm doing a
left join
on internet_per_month
(and a call to fillna
);在这里,我在
internet_per_month
上进行left join
(并调用fillna
); if it's possible that someone made calls but not internet, an outer join would be preferable.如果有人拨打电话但无法上网,则最好使用外部连接。
df = pd.merge(
left=internet_per_month,
right=calls_per_month,
how="left",
left_on=["user_id", "session_date"],
right_on=["user_id", "month"],
)
df.fillna(0)
df["usage"] = df["mb_used"] + df["call_date"]
output:输出:
user_id month call_date session_date mb_used usage
0 1000 12 16 12 2000.0 2016.0
1 1001 8 27 8 7000.0 7027.0
2 1001 9 49 9 14000.0 14049.0
3 1001 10 65 10 23000.0 23065.0
4 1001 11 64 11 19000.0 19064.0
5 1001 12 56 12 20000.0 20056.0
6 1002 10 11 10 7000.0 7011.0
7 1002 11 55 11 20000.0 20055.0
8 1002 12 47 12 15000.0 15047.0
9 1003 12 149 12 28000.0 28149.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.