[英]Group and count total of blanks and total of rows in pandas dataframe
I have the following dataframe我有以下 dataframe
Country Name Code Signed Index
0 CZ Paulo 3 x 1
1 AE Paulo Yes None 1
2 AE Paulo Yes None 2
3 AE Paulo 1 Yes 5
4 CZ Paulo None None 6
5 DK Paulo Yes None 9
6 DK Paulo None None 20
7 PT Paulo 2 Yes 20
8 PT Paulo 1 Yes 22
I need three new columns after grouping by country
按country
分组后我需要三个新列
Code
and Signed
column计算Code
和Signed
列中的缺失值If any of the Countries have all their Code
and Signed
rows filled, remove it from the dataframe.如果任何国家/地区的所有Code
和Signed
行都已填写,请将其从 dataframe 中删除。
In this case, it would return this dataframe:在这种情况下,它将返回此 dataframe:
Country Total_Blanks_on_Code Total_Blanks_on_Signed Total_of_rows_with_both_values_filled Total_of_rows_of_the_Country Rows with any blank
0 CZ 1 1 None 2 6
1 AE 2 0 1 3 [1,2]
2 DK 2 1 None 2 [9,20]
Thank you for your help!谢谢您的帮助!
Here's a way to do what your question asks:这是一种解决您的问题的方法:
df['both_filled'] = (df.Code.notna() & df.Signed.notna()).map({True:True, False:None})
df['Rows_with_any_blank'] = df.Index[df['both_filled'].isna()]
gb = df.groupby('Country', sort=False)
df2 = ( gb.count().assign(
Rows_with_any_blank=gb['Rows_with_any_blank']
.agg(lambda x: list(x.dropna().astype(int)))) )
df2 = ( df2.assign(
Total_Blanks_on_Code=df2.Name - df2.Code,
Total_Blanks_on_Signed=df2.Name - df2.Signed)
[df2.both_filled < df2.Name]
[['Total_Blanks_on_Code','Total_Blanks_on_Signed',
'both_filled','Name','Rows_with_any_blank']]
.reset_index()
.rename(columns={
'Name':'Total_of_rows_of_the_Country',
'both_filled':'Total_of_rows_with_both_values_filled'
}) )
Input:输入:
Country Name Code Signed Index
0 CZ Paulo 3 x 1
1 AE Paulo Yes None 1
2 AE Paulo Yes None 2
3 AE Paulo 1 Yes 5
4 CZ Paulo None None 6
5 DK Paulo Yes None 9
6 DK Paulo None None 20
7 PT Paulo 2 Yes 20
8 PT Paulo 1 Yes 22
Output: Output:
Country Total_Blanks_on_Code Total_Blanks_on_Signed Total_of_rows_with_both_values_filled Total_of_rows_of_the_Country Rows_with_any_blank
0 CZ 1 1 1 2 [6]
1 AE 0 2 1 3 [1, 2]
2 DK 1 2 0 2 [9, 20]
Explanation:解释:
both_filled
column which is True
if both Code
and Signed
are non-null and is None
otherwise (this allows us to later use count()
to effectively sum the number of rows having both columns non-null)如果Code
和True
均非空,则创建both_filled
列, Signed
为None
(这允许我们稍后使用count()
来有效地将两列均非空的行数相加)Rows_with_any_blank
column which contains the value in Index
for rows where neither of Code
and Signed
is null创建Rows_with_any_blank
列,其中包含Code
和Signed
都不是 null 的行的Index
中的值groubpy
object gb
by Country
按Country
创建一个groubpy
object gb
count()
to get the number of non-null entries per group in each column of gb
使用count()
获取gb
每一列中每组的非空条目数assign()
to overwrite the Rows_with_any_blank
column to be a list of the non-null Index
values for each group使用assign()
将Rows_with_any_blank
列覆盖为每个组的非空Index
值列表assign()
to create and populate columns Total_Blanks_on_Code
and Total_Blanks_on_Signed
使用assign()
创建和填充列Total_Blanks_on_Code
和Total_Blanks_on_Signed
both_filled
< the count in Name
(which is the total number of rows in the original df);只保留both_filled
中的计数< Name
中的计数(这是原始df中的总行数)的行; this removes any Country
for which all Code
and Signed
rows are filled这将删除所有Code
和Signed
行都已填写的任何Country
[[]]
Select 使用[[]]
按指定顺序排列 5 个所需列reset_index()
to switch Country
from the index to a column使用reset_index()
将Country
从索引切换到列rename()
to change Name
and both_filled
to have the specified labels Total_of_rows_of_the_Country
and Total_of_rows_with_both_values_filled
.使用rename()
更改Name
和both_filled
以具有指定的标签Total_of_rows_of_the_Country
和Total_of_rows_with_both_values_filled
。Based on the definitions/conditions you gave, the country AE should have a total blanks on Code equal to 0 and not 2.根据您提供的定义/条件,国家 AE 的代码中的总空白应等于 0 而不是 2。
Anyway, you can use the code below to get the format of output you're looking for:无论如何,您可以使用下面的代码来获取您正在寻找的 output 的格式:
out = (df.assign(Total1 = df['Code'].isna(),
Total2 = df['Signed?'].isna(),
Total3 = ~df['Code'].isna() & ~df['Signed?'].isna())
.groupby('Country', as_index=False)
.agg(NumberOfCountries = ('Country','size'),
Total1 = ('Total1','sum'),
Total2 = ('Total2','sum'),
Total3 = ('Total3','sum'))
).rename(columns={'Total1': 'Total Blanks on Code', 'Total2': 'Total Blanks on Signed?',
'Total3': 'Total of rows with both values filled', 'NumberOfCountries': 'Total of rows of the Country'})
print(out)
: >>> print(out)
:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.