[英]Compare data in two dataframes
I have two dataframes (members and expeditions).我有两个数据框(成员和探险)。 In expeditions, there is a column that gives the number of members (also called members) of the expedition and in members, we have the shippers each linked to an expedition_id to make the link between the two dataframes.
在远征中,有一列给出了远征的成员(也称为成员)的数量,在成员中,我们让每个托运人都链接到一个 expedition_id 以建立两个数据帧之间的链接。 I have calculated for each expedition_id the total number of members per expedition and I would like to compare if the number of members given in expeditions is the same as the one I calculated.
我已经为每个 expedition_id 计算了每次探险的成员总数,我想比较探险中给出的成员数量是否与我计算的相同。 Can you help me?
你能帮助我吗?
import pandas as pd
members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
expeditions = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv")
Number of members by expedition探险队成员人数
nbre_membres_expedition = members[["expedition_id", "member_id"]].groupby("expedition_id", as_index = False).count()
nbre_membres_expedition nbre_membres_expedition
To check the difference use merge
and filter row where values are different from 2 columns:要检查差异,请使用值与 2 列不同的
merge
和过滤行:
nbre_memb_exp = members.value_counts('expedition_id').rename('nbre_memb_exp')
nbre_exp_memb = expeditions.set_index('expedition_id')['members'].rename('nbre_exp_memb')
diff_df = pd.merge(nbre_memb_exp, nbre_exp_memb,
left_index=True, right_index=True, how='outer') \
.query('nbre_memb_exp != nbre_exp_memb')
Output: Output:
>>> diff_df
nbre_memb_exp nbre_exp_memb
expedition_id
ACHN15302 11.0 9 # + hired_staff=2
ACHN18301 9.0 8 # + hired_staff=1
AMAD00106 3.0 1 # + hired_staff=2
AMAD00110 10.0 8 # + hired_staff=3 ???
AMAD00112 5.0 3 # + hired_staff=2
... ... ...
YALU88301 10.0 8
YALU89301 10.0 8
YALU89401 7.0 5
YAUP13301 4.0 2
YAUP17101 9.0 6
[5431 rows x 2 columns]
I think you have to sum with the column hired_staff
.我认为您必须与
hired_staff
列相加。 Change the previous row nbre_exp_memb =...
by:将前一行
nbre_exp_memb =...
更改为:
nbre_exp_memb = expeditions.set_index('expedition_id')[['members', 'hired_staff']] \
.sum(axis=1).rename('nbre_exp_memb')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.