繁体   English   中英

如何 select 并根据 pandas python 中的特定条件组合不同的列?

[英]How to select and combine different columns based on specific condition in pandas python?

df = pd.DataFrame(data={
    "id": ['a', 'a', 'b', 'b', 'a', 'c', 'c', 'b'],
    "transaction_amount": [110, 0, 10, 30, 40.4, 62.2, 20, 20],
    "principal_amount":   [100, 0, 0,  0,  40,   60,   0,  0],
    "interest_amount":    [10,  0, 10, 0,  0.4,  0.6,  10, 0],
    "overpayment_amount": [0,   0, 0,  0,  0,    1.6,  10, 20],
})

我有上面的dataframe。 我想要一列amount并按如下方式填充它:

  • 如果每个principal_amountinterest_amountoverpayment_amount的值不为 0,则创建一行,并将principalinterestoverpayment分别分配给新列transaction_type
  • 如果该行的其他三列值为 0,则从transaction_amount获取值。

output 应如下所示:

   amount transaction_type id
3    30.0              NaN  b
0   100.0        principal  a
4    40.0        principal  a
5    60.0        principal  c
0    10.0         interest  a
2    10.0         interest  b
4     0.4         interest  a
5     0.6         interest  c
6    10.0         interest  c
5     1.6      overpayment  c
6    10.0      overpayment  c
7    20.0      overpayment  b

我目前的解决方案:

import pandas as pd

df = pd.DataFrame(data={
    "id": ['a', 'a', 'b', 'b', 'a', 'c', 'c', 'b'],
    "transaction_amount": [110, 0, 10, 30, 40.4, 62.2, 20, 20],
    "principal_amount":   [100, 0, 0,  0,  40,   60,   0,  0],
    "interest_amount":    [10,  0, 10, 0,  0.4,  0.6,  10, 0],
    "overpayment_amount": [0,   0, 0,  0,  0,    1.6,  10, 20],
})

columns = ["amount", "transaction_type"]
output_df = pd.DataFrame(columns=columns)

# Add transaction amount
condition = (df["principal_amount"] == 0) & (df["interest_amount"] == 0) & (df["overpayment_amount"] == 0) & (df["transaction_amount"] != 0)
subdf = df.loc[condition, ['id', 'transaction_amount']]
subdf = subdf.rename(columns={'transaction_amount': "amount"})
output_df = output_df.append(subdf)

# Add principal and interest
for field in ["principal_amount", "interest_amount", "overpayment_amount"]:
    subdf = df.loc[df[field] != 0, ['id', field]]
    subdf["transaction_type"] = field.split("_")[0]
    subdf = subdf.rename(columns={field: "amount"})
    output_df = output_df.append(subdf)

是否有任何 pandas 功能可以帮助我更简洁高效地执行此操作?

一种方法可以如下。

import pandas as pd
import numpy as np

out = df.reset_index(drop=False).melt(
    id_vars=['index'], 
    value_vars=list(df.columns)[1:], 
    var_name='transaction_type', 
    value_name='amount'
    ).set_index('index')

out = out[out['amount'].gt(0)]
out['v'] = out.index.value_counts()

out = out[out.v.eq(1) | 
          out.transaction_type.ne('transaction_amount')].drop('v', axis=1)

out['transaction_type'] = out['transaction_type']\
    .str.replace('_amount','').replace({'transaction':np.nan})

out = out.iloc[:,::-1]
out.index.name=None
out['id'] = df['id']

print(out)

   amount transaction_type id
3    30.0              NaN  b
0   100.0        principal  a
4    40.0        principal  a
5    60.0        principal  c
0    10.0         interest  a
2    10.0         interest  b
4     0.4         interest  a
5     0.6         interest  c
6    10.0         interest  c
5     1.6      overpayment  c
6    10.0      overpayment  c
7    20.0      overpayment  b

解释方法:

  • 我们使用df.melt在两个单独的列中获取所有列名(从第二列开始)和数量,并确保还保留原始索引值(首先重置索引,然后再次将其设置为“索引”) .
  • 我们通过在amount上使用Series.gt只保留 amount > 0 的行。
  • 我们创建一个临时列来存储应用于索引的Series.value_counts 每个值计数为1的索引值将仅具有与transaction_amount关联的值。
  • 我们将此信息用于另一个过滤器:仅保留具有out['v'].eq(1)transaction_type不是 'transaction_amount' 的行。 之后,我们可以再次删除临时列。
  • 最后,我们去掉了transaction_type列中的“_amount”,并将“transaction”替换为NaN值。 最后的整容程序是按请求的顺序获取列,删除索引名称,并将id添加为额外的列。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM