简体   繁体   English

具有特定条件的pandas中的数据帧

[英]dataframe in pandas with certain conditions

am trying to combine features of in a dataframe to derive a new columns in the dataframe 我试图结合数据框中的功能来导出数据框中的新列

I have this dataframe 我有这个数据帧

Id   Author   News_post  Label
1    Jessica  xxxxxxxxx  1
2    Adams    xxxxxxxxx  1
3    Adams    xxxxxxxxx  1
4    Mike     xxxxxxxxx  0
5    James    xxxxxxxxx  1
6    Mike     xxxxxxxxx  1
7    Mike     xxxxxxxxx  0
8    Paul     xxxxxxxxx  0
9    Jessica  xxxxxxxxx  0
10   Adams    xxxxxxxxx  0

NB: where the Label column have 1=TRUE AND 0=FALSE 注意: Label列的位置为1=TRUE 0=FALSE

Id   Author   Num_Post  Num_True_Label  Num_False_Label   Mean
1    Adams    3         2               1                 x
2    James    1         1               0                 x
3    Jessica  2         1               1                 x
4    Mike     2         0               1                 x
5    Paul     1         0               0                 x

This may solve a number of things you are trying to get from your issue: 这可能会解决您尝试从您的问题中获得的一些事项:

df = pd.read_clipboard()  # just copied your dataframe
df = df.groupby('Author').describe()

Output: 输出:

           Id                                               Label                                               
        count      mean       std  min  25%  50%  75%   max count      mean       std  min   25%  50%   75%  max
Author                                                                                                          
Adams     3.0  5.000000  4.358899  2.0  2.5  3.0  6.5  10.0   3.0  0.666667  0.577350  0.0  0.50  1.0  1.00  1.0
James     1.0  5.000000       NaN  5.0  5.0  5.0  5.0   5.0   1.0  1.000000       NaN  1.0  1.00  1.0  1.00  1.0
Jessica   2.0  5.000000  5.656854  1.0  3.0  5.0  7.0   9.0   2.0  0.500000  0.707107  0.0  0.25  0.5  0.75  1.0
Mike      3.0  5.666667  1.527525  4.0  5.0  6.0  6.5   7.0   3.0  0.333333  0.577350  0.0  0.00  0.0  0.50  1.0
Paul      1.0  8.000000       NaN  8.0  8.0  8.0  8.0   8.0   1.0  0.000000       NaN  0.0  0.00  0.0  0.00  0.0

The following will get you what you need: 以下内容将为您提供所需:

In [1]: import pandas as pd                                                                                                                                                                                                                  

In [2]: df = pd.DataFrame({'Author': ['Jessica', 'Adams', 'Adams', 'Mike', 'James', 'Mike', 'Mike', 'Paul', 'Jessica', 'Adams'], 'News_post': ['xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx', 'xx
    ...: xxxxxxx', 'xxxxxxxxx', 'xxxxxxxxx'], 'Label': [1,1,1,0,1,1,0,0,0,0]})                                                                                                                                                                

In [3]: num_true_label_df = df.groupby(by=['Author']).sum().rename(columns={'Label': 'Num_True_Label'}).reset_index()                                                                                                                        

In [4]: num_post_df = df.groupby(by=['Author']).count().rename(columns={'News_post': 'Num_Post'})[['Num_Post']].reset_index()                                                                                                                

In [5]: df = pd.merge(num_post_df, num_true_label_df, how='left', on='Author').reset_index().rename(columns={'index': 'Id'})

In [6]: df['Id'] = df['Id'] + 1

In [7]: df['Num_False_Label'] = df['Num_Post'] - df['Num_True_Label']

In [8]: df                                                                                                                                                                                                                                
Out[7]: 
   Id   Author  Num_Post  Num_True_Label  Num_False_Label
0   1    Adams         3               2                1
1   2    James         1               1                0
2   3  Jessica         2               1                1
3   4     Mike         3               1                2
4   5     Paul         1               0                1


Please further specify what your Mean column should represent. 请进一步说明您的Mean列应代表什么。

Some resources which might be helpful: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html 一些可能有用的资源: https//pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

Using Pandas 0.25 with aggregation relabeling 使用Pandas 0.25和聚合重新标记

df.groupby('Author')['Label'].agg(Num_Post = 'size',
                                  Num_True = 'sum',
                                  Num_False = lambda x: x.eq(0).sum(),
                                  Mean = 'mean')

Output: 输出:

         Num_Post  Num_True  Num_False      Mean
Author                                          
Adams           3         2          1  0.666667
James           1         1          0  1.000000
Jessica         2         1          1  0.500000
Mike            3         1          2  0.333333
Paul            1         0          1  0.000000

Use transform and then remove the duplicates such that: 使用transform然后删除重复项,以便:

df['Num_Post']= df.groupby(['Author'])['Label'].transform('count')
df['Num_True_Label']= df.groupby(['Author'])['Label'].transform('sum')
df['Num_False_Label']= df['Num_Post']-df['Num_True_Label']
df['Mean']= df['Num_Post']/df['Num_True_Label']

Finally: drop dups and remove the News_post 最后:删除重复并删除News_post

df.drop(columns=['News_post'], inplace=True)
df.drop_duplicates(subset='Author', keep='first').sort_values(by=['Author'])

result: 结果:

    Id  Author      Label   Num_Post    Num_True_Label  Num_False_Label Mean
    1   2   Adams       1       3           2               1               1.500000
    4   5   James       1       1           1               0               1.000000
    0   1   Jessica     1       2           1               1               2.000000
    3   4   Mike        0       3           1               2               3.000000
    7   8   Paul        0       1           0               1               inf

Note: change the mean for your definition. 注意:更改定义的平均值。

you could try : 你可以尝试:

agg_df = df.groupby('Author')['Label'].agg({"Num_post" : 'count', 'Num_True_Label' : 
                                             lambda x : x.eq(1).sum(), 
                                            'Num_False_Label':lambda x : 
                                            x.eq(0).sum(), 
                                            'Mean':'mean'}).reset_index()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM