简体   繁体   English

对 pandas dataframe 中的空白总数和行总数进行分组和计数

[英]Group and count total of blanks and total of rows in pandas dataframe

I have the following dataframe我有以下 dataframe

  Country   Name  Code Signed  Index
0      CZ  Paulo     3      x   1
1      AE  Paulo   Yes   None   1
2      AE  Paulo   Yes   None   2
3      AE  Paulo     1    Yes   5
4      CZ  Paulo  None   None   6
5      DK  Paulo   Yes   None   9
6      DK  Paulo  None   None   20
7      PT  Paulo     2    Yes   20
8      PT  Paulo     1    Yes   22

I need three new columns after grouping by countrycountry分组后我需要三个新列

  1. count the missing values in Code and Signed column计算CodeSigned列中的缺失值
  2. total of rows that have both values filled填充了两个值的行总数
  3. total of rows that have the same Country value具有相同 Country 值的行总数
  4. point the rows where we have any of those values blank per Country (list or non list format) using the column "Index" as reference使用列“索引”作为参考,将每个国家/地区(列表或非列表格式)的任何这些值都指向空白的行

If any of the Countries have all their Code and Signed rows filled, remove it from the dataframe.如果任何国家/地区的所有CodeSigned行都已填写,请将其从 dataframe 中删除。

In this case, it would return this dataframe:在这种情况下,它将返回此 dataframe:

  Country  Total_Blanks_on_Code  Total_Blanks_on_Signed Total_of_rows_with_both_values_filled  Total_of_rows_of_the_Country  Rows with any blank
0      CZ                     1                       1                                  None                             2                    6
1      AE                     2                       0                                     1                             3                   [1,2]
2      DK                     2                       1                                  None                             2                   [9,20]

Thank you for your help!谢谢您的帮助!

Here's a way to do what your question asks:这是一种解决您的问题的方法:

df['both_filled'] = (df.Code.notna() & df.Signed.notna()).map({True:True, False:None})
df['Rows_with_any_blank'] = df.Index[df['both_filled'].isna()]
gb = df.groupby('Country', sort=False)
df2 = ( gb.count().assign(
    Rows_with_any_blank=gb['Rows_with_any_blank']
    .agg(lambda x: list(x.dropna().astype(int)))) )
df2 = ( df2.assign(
    Total_Blanks_on_Code=df2.Name - df2.Code,
    Total_Blanks_on_Signed=df2.Name - df2.Signed)
    [df2.both_filled < df2.Name]
    [['Total_Blanks_on_Code','Total_Blanks_on_Signed',
        'both_filled','Name','Rows_with_any_blank']]
    .reset_index()
    .rename(columns={
        'Name':'Total_of_rows_of_the_Country', 
        'both_filled':'Total_of_rows_with_both_values_filled'
    }) )

Input:输入:

  Country   Name  Code Signed  Index
0      CZ  Paulo     3      x      1
1      AE  Paulo   Yes   None      1
2      AE  Paulo   Yes   None      2
3      AE  Paulo     1    Yes      5
4      CZ  Paulo  None   None      6
5      DK  Paulo   Yes   None      9
6      DK  Paulo  None   None     20
7      PT  Paulo     2    Yes     20
8      PT  Paulo     1    Yes     22

Output: Output:

  Country  Total_Blanks_on_Code  Total_Blanks_on_Signed  Total_of_rows_with_both_values_filled  Total_of_rows_of_the_Country Rows_with_any_blank
0      CZ                     1                       1                                      1                             2                 [6]
1      AE                     0                       2                                      1                             3              [1, 2]
2      DK                     1                       2                                      0                             2             [9, 20]

Explanation:解释:

  • Create both_filled column which is True if both Code and Signed are non-null and is None otherwise (this allows us to later use count() to effectively sum the number of rows having both columns non-null)如果CodeTrue均非空,则创建both_filled列, SignedNone (这允许我们稍后使用count()来有效地将两列均非空的行数相加)
  • Create Rows_with_any_blank column which contains the value in Index for rows where neither of Code and Signed is null创建Rows_with_any_blank列,其中包含CodeSigned都不是 null 的行的Index中的值
  • Create a groubpy object gb by CountryCountry创建一个groubpy object gb
  • Use count() to get the number of non-null entries per group in each column of gb使用count()获取gb每一列中每组的非空条目数
  • Use assign() to overwrite the Rows_with_any_blank column to be a list of the non-null Index values for each group使用assign()Rows_with_any_blank列覆盖为每个组的非空Index值列表
  • Use assign() to create and populate columns Total_Blanks_on_Code and Total_Blanks_on_Signed使用assign()创建和填充列Total_Blanks_on_CodeTotal_Blanks_on_Signed
  • Keep only rows where the count in both_filled < the count in Name (which is the total number of rows in the original df);只保留both_filled中的计数< Name中的计数(这是原始df中的总行数)的行; this removes any Country for which all Code and Signed rows are filled这将删除所有CodeSigned行都已填写的任何Country
  • Select the 5 desired columns in the specified order using [[]] Select 使用[[]]按指定顺序排列 5 个所需列
  • Use reset_index() to switch Country from the index to a column使用reset_index()Country从索引切换到列
  • Use rename() to change Name and both_filled to have the specified labels Total_of_rows_of_the_Country and Total_of_rows_with_both_values_filled .使用rename()更改Nameboth_filled以具有指定的标签Total_of_rows_of_the_CountryTotal_of_rows_with_both_values_filled

Based on the definitions/conditions you gave, the country AE should have a total blanks on Code equal to 0 and not 2.根据您提供的定义/条件,国家 AE 的代码中的总空白应等于 0 而不是 2。

Anyway, you can use the code below to get the format of output you're looking for:无论如何,您可以使用下面的代码来获取您正在寻找的 output 的格式:

out = (df.assign(Total1 = df['Code'].isna(),
                 Total2 = df['Signed?'].isna(),
                 Total3 = ~df['Code'].isna() & ~df['Signed?'].isna())
        .groupby('Country', as_index=False)
        .agg(NumberOfCountries = ('Country','size'),
             Total1 = ('Total1','sum'),
             Total2 = ('Total2','sum'),
             Total3 = ('Total3','sum'))
        ).rename(columns={'Total1': 'Total Blanks on Code', 'Total2': 'Total Blanks on Signed?', 
                    'Total3': 'Total of rows with both values filled', 'NumberOfCountries': 'Total of rows of the Country'})

>>> print(out) : >>> print(out)

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM