简体   繁体   English

比较两个数据框中的数据

[英]Compare data in two dataframes

I have two dataframes (members and expeditions).我有两个数据框(成员和探险)。 In expeditions, there is a column that gives the number of members (also called members) of the expedition and in members, we have the shippers each linked to an expedition_id to make the link between the two dataframes.在远征中,有一列给出了远征的成员(也称为成员)的数量,在成员中,我们让每个托运人都链接到一个 expedition_id 以建立两个数据帧之间的链接。 I have calculated for each expedition_id the total number of members per expedition and I would like to compare if the number of members given in expeditions is the same as the one I calculated.我已经为每个 expedition_id 计算了每次探险的成员总数,我想比较探险中给出的成员数量是否与我计算的相同。 Can you help me?你能帮助我吗?

import pandas as pd

members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
expeditions = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv")

Number of members by expedition探险队成员人数

nbre_membres_expedition = members[["expedition_id", "member_id"]].groupby("expedition_id", as_index = False).count()

nbre_membres_expedition nbre_membres_expedition

To check the difference use merge and filter row where values are different from 2 columns:要检查差异,请使用值与 2 列不同的merge和过滤行:

nbre_memb_exp = members.value_counts('expedition_id').rename('nbre_memb_exp')
nbre_exp_memb = expeditions.set_index('expedition_id')['members'].rename('nbre_exp_memb')

diff_df = pd.merge(nbre_memb_exp, nbre_exp_memb, 
                   left_index=True, right_index=True, how='outer') \
            .query('nbre_memb_exp != nbre_exp_memb')

Output: Output:

>>> diff_df
               nbre_memb_exp  nbre_exp_memb
expedition_id                              
ACHN15302               11.0              9  # + hired_staff=2
ACHN18301                9.0              8  # + hired_staff=1
AMAD00106                3.0              1  # + hired_staff=2
AMAD00110               10.0              8  # + hired_staff=3 ???
AMAD00112                5.0              3  # + hired_staff=2
...                      ...            ...
YALU88301               10.0              8
YALU89301               10.0              8
YALU89401                7.0              5
YAUP13301                4.0              2
YAUP17101                9.0              6

[5431 rows x 2 columns]

I think you have to sum with the column hired_staff .我认为您必须与hired_staff列相加。 Change the previous row nbre_exp_memb =... by:将前一行nbre_exp_memb =...更改为:

nbre_exp_memb = expeditions.set_index('expedition_id')[['members', 'hired_staff']] \
                           .sum(axis=1).rename('nbre_exp_memb')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM