繁体   English   中英

根据汇总类别创建新列2

[英]Create new columns from aggregated categories 2

+-------+------------+---------------+-----------------+
| INDEX | SK_ID_CURR | CREDIT_ACTIVE | CREDIT_TYPE     |
+-------+------------+---------------+-----------------+
|     0 |     215354 | Closed        | Consumer credit |
+-------+------------+---------------+-----------------+
|     1 |     215354 | Active        | Credit card     |
+-------+------------+---------------+-----------------
|     2 |     215354 | Active        | Consumer credit |
+-------+------------+---------------+-----------------+
|     3 |     215354 | Active        | Credit card     |
+-------+------------+---------------+-----------------+
|     4 |     215354 | Active        | Consumer credit |
+-------+------------+---------------+-----------------+
|     5 |     215354 | Active        | Credit card     |
+-------+------------+---------------+-----------------+
|     6 |     215354 | Active        | Consumer credit |
+-------+------------+---------------+-----------------+
|     7 |     162297 | Closed        | Consumer credit |
+-------+------------+---------------+-----------------+
|     8 |     162297 | Closed        | Consumer credit |
+-------+------------+---------------+-----------------+
|     9 |     162297 | Active        | Credit card     |
+-------+------------+---------------+-----------------+
|    10 |     162297 | Active        | Credit card     |
+-------+------------+---------------+-----------------+
|    11 |     162297 | Closed        | Consumer credit |
+-------+------------+---------------+-----------------+
|    12 |     162297 | Active        | Mortgage        |
+-------+------------+---------------+-----------------+
|    13 |     402440 | Active        | Consumer credit |
+-------+------------+---------------+-----------------+
|    14 |     238881 | Closed        | Credit card     |
+-------+------------+---------------+-----------------+

我有上面的表。 我想汇总每个ID的每一列。 例如,我需要计算每个SK_ID_CURR的有效和已关闭信用的SK_ID_CURR ,然后使用已计数的值为active_credits和closed_credits创建一列。 CREDIT_TYPE相同。

喜欢:

SK_ID_CURR CREDIT_ACTIVE CREDIT_CLOSED CONSUMER_CREDIT CREDIT_CARD
215354       6                  1           4             3

对于此数据框:

d={'SK_ID_CURR':[215354, 215354, 215354, 215354, 215354, 215354, 215354, 162297, 162297, 162297, 162297, 162297, 162297,402440 ,238881],
   'CREDIT_ACTIVE':['Closed', 'Active', 'Active', 'Active', 'Active', 'Active', 'Active', 'Closed', 'Closed', 'Active', 'Active', 'Closed', 'Active', 'Active', 'Closed' ],
   'CREDIT_TYPE':['Consumer credit', 'Credit card', 'Consumer credit', 'Credit card', 'Consumer credit', 'Credit card', 'Consumer credit', 'Consumer credit', 'Consumer credit', 'Credit card', 'Credit card', 'Consumer credit',                      'Mortgage', 'Consumer credit', 'Credit card', ]}
df=pd.DataFrame(d)

print(df)

输出:

    SK_ID_CURR CREDIT_ACTIVE      CREDIT_TYPE
0       215354        Closed  Consumer credit
1       215354        Active      Credit card
2       215354        Active  Consumer credit
3       215354        Active      Credit card
4       215354        Active  Consumer credit
5       215354        Active      Credit card
6       215354        Active  Consumer credit
7       162297        Closed  Consumer credit
8       162297        Closed  Consumer credit
9       162297        Active      Credit card
10      162297        Active      Credit card
11      162297        Closed  Consumer credit
12      162297        Active         Mortgage
13      402440        Active  Consumer credit
14      238881        Closed      Credit card

您可以尝试如下操作:

aggregations = {
        'CREDIT_ACTIVE': { # work on this column, 
            'CREDIT_ACTIVE': lambda x: list(x).count('Active'),
            'CREDIT_CLOSED': lambda x: list(x).count('Closed') 
        },
        'CREDIT_TYPE': { # work on this column, 
            'CONSUMER_CREDIT ': lambda x: list(x).count('Consumer credit'),
            'CREDIT_CARD': lambda x: list(x).count('Credit card') 
        }}
temp=df.groupby('SK_ID_CURR').agg(aggregations).reset_index()
temp.columns = [e[1] for e in temp.columns.tolist()] 

print(temp)

输出:

           CREDIT_ACTIVE  CREDIT_CLOSED  CONSUMER_CREDIT   CREDIT_CARD
0  162297              3              3                 3            2
1  215354              6              1                 4            3
2  238881              0              1                 0            1
3  402440              1              0                 1            0

另一种方式,也许有些乏味,但可能会引入一些不同的东西。

creditClosed = df[df.CREDIT_ACTIVE == 'Closed']
creditOpened = df[df.CREDIT_ACTIVE == 'Active']
creditTypeCo = df[df.CREDIT_TYPE == 'Credit card']
creditTypeCr = df[df.CREDIT_TYPE == 'Consumer credit']

a = creditClosed.groupby(['SK_ID_CURR']).agg({'CREDIT_ACTIVE':'count'}).reset_index()
b = creditOpened.groupby(['SK_ID_CURR']).agg({'CREDIT_ACTIVE':'count'}).reset_index()
c = creditTypeCo.groupby(['SK_ID_CURR']).agg({'CREDIT_TYPE':'count'}).reset_index()
d = creditTypeCr.groupby(['SK_ID_CURR']).agg({'CREDIT_TYPE':'count'}).reset_index()

ab = pd.merge(a, b, how = 'outer', on = 'SK_ID_CURR')
abc = pd.merge(ab, c, how = 'outer', on = 'SK_ID_CURR')
final = pd.merge(abc, d, how = 'outer', on = 'SK_ID_CURR')

final.rename(columns = {'CREDIT_ACTIVE_x': 'CREDIT_CLOSED', 'CREDIT_ACTIVE_y': 'CREDIT_ACTIVE', 'CREDIT_TYPE_x': 'CREDIT_CARD', 'CREDIT_TYPE_y': 'CONSUMER_CREDIT'}, inplace = True)
final.fillna(0)

输出:

           CREDIT_ACTIVE  CREDIT_CLOSED  CONSUMER_CREDIT   CREDIT_CARD
0  162297              3              3                 3            2
1  215354              6              1                 4            3
2  238881              0              1                 0            1
3  402440              1              0                 1            0

您可以使用pd.get_dummies(df.drop(columns=['SK_ID_CURR']))生成如下的虚拟列: 数据帧的虚拟

将其与“ SK_ID_CURR”列连接,然后可以按“ SK_ID_CURR”分组。 之后,使用agg([sum])按和汇总数据。 最后,这是有意义地重命名列的问题。

使用pandas的python中的示例代码:

a = pd.get_dummies(df.drop(columns=['SK_ID_CURR']))
a = pd.concat([df.SK_ID_CURR, a], axis=1)
b = a.groupby(a.SK_ID_CURR).agg([sum])
b.columns = ['CREDIT_Active','CREDIT_Closed', 'Consumer_Credit', 'Credit_Card','Credit_Mortgage']
b.reset_index(inplace=True)

构造帮助器列后,您可以加入几个pd.crosstab结果。

来自@AllaTarighati的数据。

df['TYPE'] = np.where(df['CREDIT_TYPE'].str.contains('credit', case=False, na=False),
                      'Credit', 'Mortgage')

cross1 = pd.crosstab(df['SK_ID_CURR'], df['TYPE'] + '_' + df['CREDIT_ACTIVE'])
cross2 = pd.crosstab(df['SK_ID_CURR'], df['CREDIT_TYPE'])
res = cross1.join(cross2)

结果

print(res)

            Credit_Active  Credit_Closed  Mortgage_Active  Consumer credit  \
SK_ID_CURR                                                                   
162297                  2              3                1                3   
215354                  6              1                0                4   
238881                  0              1                0                0   
402440                  1              0                0                1   

            Credit card  Mortgage  
SK_ID_CURR                         
162297                2         1  
215354                3         0  
238881                1         0  
402440                0         0  

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM