简体   繁体   English

Python pandas:根据条件从多个数据框中访问数据

[英]Python pandas: Accessing data from multiple data frame based on condition

I have to calculate a metric that requires me to find the attributes of the same 'user' from multiple columns.我必须计算一个指标,该指标要求我从多个列中查找同一“用户”的属性。 For example, I have two data frames shown below:例如,我有两个数据框如下所示:

calls_per_month.head(10)
    user_id month   call_date
0   1000    12  16
1   1001    8   27
2   1001    9   49
3   1001    10  65
4   1001    11  64
5   1001    12  56
6   1002    10  11
7   1002    11  55
8   1002    12  47
9   1003    12  149

internet_per_month.head(10)

 user_id session_date mb_used
0   1000    12  2000.0
1   1001    8   7000.0
2   1001    9   14000.0
3   1001    10  23000.0
4   1001    11  19000.0
5   1001    12  20000.0
6   1002    10  7000.0
7   1002    11  20000.0
8   1002    12  15000.0
9   1003    12  28000.0

I want to calculate a metric that would look something like this for each user_id for every month they used the internet or made a call: `usage = mb_used + call_date' and it would be a column that would look like ( I have done hand calculation):我想为他们使用互联网或拨打电话的每个月的每个 user_id 计算一个类似这样的指标:`usage = mb_used + call_date',它将是一个看起来像的列(我已经完成了手工计算):

 user_id month usage
0   1000    12  2016
1   1001    8   7027
2   1001    9   14049
3   1001    10  23065
4   1001    11  19064
5   1001    12  20056
6   1002    10  7011
7   1002    11  20055
8   1002    12  15047
9   1003    12  28149

The head of the above I showed does not show it, but there are some users who did not make a call in a particular month but used data, so I have to account for that, in the sense it should not ignore those users and just add 0 for the data not available.我上面展示的那个头没有显示,但是有一些用户在特定月份没有打电话但使用了数据,所以我必须考虑到这一点,从某种意义上说它不应该忽略这些用户而只是不可用的数据加0。

Should I first do an outer join of the tables?我应该首先对表进行外部联接吗? Or is creating a new table not the correct way to do it?或者创建一个新表不是正确的方法吗? Any guidance is appreciated.任何指导表示赞赏。

Thank you谢谢

You should merge or join these first, then do the operation.您应该先合并或加入这些,然后再进行操作。 Here I'm doing a left join on internet_per_month (and a call to fillna );在这里,我在internet_per_month上进行left join (并调用fillna ); if it's possible that someone made calls but not internet, an outer join would be preferable.如果有人拨打电话但无法上网,则最好使用外部连接。

df = pd.merge(
    left=internet_per_month, 
    right=calls_per_month, 
    how="left",
    left_on=["user_id", "session_date"], 
    right_on=["user_id", "month"],
)

df.fillna(0)
df["usage"] = df["mb_used"] + df["call_date"]

output:输出:

   user_id  month  call_date  session_date  mb_used    usage
0     1000     12         16            12   2000.0   2016.0
1     1001      8         27             8   7000.0   7027.0
2     1001      9         49             9  14000.0  14049.0
3     1001     10         65            10  23000.0  23065.0
4     1001     11         64            11  19000.0  19064.0
5     1001     12         56            12  20000.0  20056.0
6     1002     10         11            10   7000.0   7011.0
7     1002     11         55            11  20000.0  20055.0
8     1002     12         47            12  15000.0  15047.0
9     1003     12        149            12  28000.0  28149.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM