Pandas / Pyspark for 循环列减法

Question

我想知道一种从数据框中的信用余额中减去价值凭证的方法。

有一列“credit”将尝试匹配使用的凭证：“v1”、“v2”、ecc。

是：

成为：

因此，应从最近到最近的时间覆盖凭证。 从凭证 3 到凭证 1。

信用栏应尽量覆盖凭证（从 3 到 1）。 如果信用超过凭证，则剩余信用应存储在信用列中。

我正在使用带有 pandas 和 PySpark 库的 python 笔记本。

Answer 1

实现这一目标的众多方法之一是使用 pandas.apply。
看看这是否有帮助：

import numpy as np
import pandas as pd

data={
  "name":["tom","jim"],
  "cummulative":[17,15],
  "voucher1":[10,0],
  "voucher2":[5,5],
  "voucher3":[2,10],
  "credit":[20,10]
}
df=pd.DataFrame(data)


def change_order(row):
  new_dict=row.to_dict()
  credit=row.credit
  cummulative=row.cummulative
  for i in range(3,0,-1):
    current=row[f"voucher{i}"]
    if credit>=current:
      credit-=current
      cummulative-=current
      new_dict["credit"]=credit
      new_dict["cummulative"]=cummulative
      new_dict[f"voucher{i}"]=0
  series=pd.Series(new_dict)
  return series
df=df.apply(change_order,axis=1)
print(df)

输出：

  name  cummulative  voucher1  voucher2  voucher3  credit
0  tom            0         0         0         0       3
1  jim            5         0         5         0       0

Answer 2

这实际上是一个非常好的练习。

这是可以在 PySpark 中创建的。 不过感觉应该有更好的方法...

输入：

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('tom', 17, 10, 5, 2, 20),
     ('jim', 15, 0, 5, 10, 10)],
    ['name', 'cumulative_vouchers_used', 'voucher1', 'voucher2', 'voucher3', 'credit'])

脚本：

c, v1, v2, v3 = F.col('credit'), F.col('voucher1'), F.col('voucher2'), F.col('voucher3')

subt_v3 = F.when(c >= v3, c - v3).otherwise(c)
new_v3 = F.when(c >= v3, 0).otherwise(v3)
subt_v2 = F.when((subt_v3 >= v2) & (subt_v3 != c), subt_v3 - v2).otherwise(subt_v3)
new_v2 = F.when(subt_v3 >= v2, 0).otherwise(v2)
subt_v1 = F.when((subt_v2 >= v1) & (subt_v2 != subt_v3), subt_v2 - v1).otherwise(subt_v2)
new_v1 = F.when(subt_v2 >= v1, 0).otherwise(v1)
new_cum = new_v1 + new_v2 + new_v3

df = df.select(
    'name',
    new_cum.alias('cumulative_vouchers_used'),
    new_v1.alias('voucher1'),
    new_v2.alias('voucher2'),
    new_v3.alias('voucher3'),
    subt_v1.alias('credit')
)

df.show()
# +----+------------------------+--------+--------+--------+------+
# |name|cumulative_vouchers_used|voucher1|voucher2|voucher3|credit|
# +----+------------------------+--------+--------+--------+------+
# | tom|                       0|       0|       0|       0|     3|
# | jim|                       5|       0|       5|       0|     0|
# +----+------------------------+--------+--------+--------+------+

Pandas / Pyspark for 循环列减法

问题描述

2 个解决方案

解决方案1
0 已采纳 2022-07-13 17:23:50

解决方案2
0 2022-07-13 20:10:03

Pandas / Pyspark for 循环列减法

问题描述

2 个解决方案

解决方案1 0 已采纳 2022-07-13 17:23:50

解决方案2 0 2022-07-13 20:10:03

解决方案1
0 已采纳 2022-07-13 17:23:50

解决方案2
0 2022-07-13 20:10:03