比较两个数据框中的数据

Question

I have two dataframes (members and expeditions).我有两个数据框（成员和探险）。 In expeditions, there is a column that gives the number of members (also called members) of the expedition and in members, we have the shippers each linked to an expedition_id to make the link between the two dataframes.在远征中，有一列给出了远征的成员（也称为成员）的数量，在成员中，我们让每个托运人都链接到一个 expedition_id 以建立两个数据帧之间的链接。 I have calculated for each expedition_id the total number of members per expedition and I would like to compare if the number of members given in expeditions is the same as the one I calculated.我已经为每个 expedition_id 计算了每次探险的成员总数，我想比较探险中给出的成员数量是否与我计算的相同。 Can you help me?你能帮助我吗？

import pandas as pd

members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
expeditions = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv")

Number of members by expedition探险队成员人数

nbre_membres_expedition = members[["expedition_id", "member_id"]].groupby("expedition_id", as_index = False).count()

nbre_membres_expedition nbre_membres_expedition

Answer 1

To check the difference use merge and filter row where values are different from 2 columns:要检查差异，请使用值与 2 列不同的merge和过滤行：

nbre_memb_exp = members.value_counts('expedition_id').rename('nbre_memb_exp')
nbre_exp_memb = expeditions.set_index('expedition_id')['members'].rename('nbre_exp_memb')

diff_df = pd.merge(nbre_memb_exp, nbre_exp_memb, 
                   left_index=True, right_index=True, how='outer') \
            .query('nbre_memb_exp != nbre_exp_memb')

Output: Output：

>>> diff_df
               nbre_memb_exp  nbre_exp_memb
expedition_id                              
ACHN15302               11.0              9  # + hired_staff=2
ACHN18301                9.0              8  # + hired_staff=1
AMAD00106                3.0              1  # + hired_staff=2
AMAD00110               10.0              8  # + hired_staff=3 ???
AMAD00112                5.0              3  # + hired_staff=2
...                      ...            ...
YALU88301               10.0              8
YALU89301               10.0              8
YALU89401                7.0              5
YAUP13301                4.0              2
YAUP17101                9.0              6

[5431 rows x 2 columns]

I think you have to sum with the column hired_staff .我认为您必须与hired_staff列相加。 Change the previous row nbre_exp_memb =... by:将前一行nbre_exp_memb =...更改为：

nbre_exp_memb = expeditions.set_index('expedition_id')[['members', 'hired_staff']] \
                           .sum(axis=1).rename('nbre_exp_memb')

比较两个数据框中的数据

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-11-28 22:41:54

比较两个数据框中的数据

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-11-28 22:41:54

解决方案1
2 已采纳 2021-11-28 22:41:54