[英]How to find the top column values of each row in a pandas dataframe
[英]How to determine top 3 column values for each row
我有一個 dataframe 的格式
| ID | Payer | Payee | Mode1 | Probability1 | Mode2 | Probability2 | Mode3 | Probability3 | Mode4 | Probability4 | Month |
|----|-------|-------|-------|--------------|-------|--------------|--------|--------------|--------|--------------|--------|
| 1 | xyz | wqu | cash | 0.16 | wire | 0.89 | upi | 0.81 | cheque | 0.69 | 201801 |
| 2 | wqu | xyz | wire | 0.28 | cash | 0.19 | upi | 0.77 | cheque | 0.58 | 201801 |
| 3 | pqr | xyz | upi | 0.35 | cash | 0.11 | cheque | 0.48 | wire | 0.66 | 201803 |
概率列具有模式列的對應值
現在我想按列為每一行獲取前 3 個概率值
像這樣的東西,
| ID | Payer | Payee | Mode1 | Probability1 | Mode2 | Probability2 | Mode3 | Probability3 | Mode4 | Probability4 | Month | Top1Mode | Top1Value | Top2Mode | Top2Value | Top3Mode | Top3Value |
|----|-------|-------|-------|--------------|-------|--------------|--------|--------------|--------|--------------|--------|----------|-----------|----------|-----------|----------|-----------|
| 1 | xyz | wqu | cash | 0.16 | wire | 0.89 | upi | 0.81 | cheque | 0.69 | 201801 | wire | 0.89 | upi | 0.81 | cheque | 0.69 |
| 2 | wqu | xyz | wire | 0.28 | cash | 0.19 | upi | 0.77 | cheque | 0.58 | 201801 | upi | 0.77 | cheque | 0.58 | wire | 0.28 |
| 3 | pqr | xyz | upi | 0.35 | cash | 0.11 | cheque | 0.48 | wire | 0.66 | 201803 | wire | 0.66 | cheque | 0.48 | upi | 0.35 |
為了進一步解釋,對於第 1 行或 ID 1。電線具有最高概率(即 0.89),因此它位於 Top1Mode 列中,其值位於下一列中。 類似地,UPI 具有第二高的概率,因此它在 Top2Mode 列中以及它在下一列中的值(即 Top2Value)
使用 Pandas 或 PySpark 進行操作,它們中的任何一個都適合我
我能想到的一件事是使用 UDF(但我想看看是否有人有更好的解決方案):
@UDF
def getProbability(Mode1, Probability1, Mode2, Probability2, Mode3, Probability3, Mode4, Probability4, num, mode):
prob_list = []
prob_list.append((Mode1, Probability1))
prob_list.append((Mode2, Probability2))
prob_list.append((Mode3, Probability3))
prob_list.append((Mode4, Probability4))
prob_list = sorted(prob_list, key = lambda x: x[1], reverse=True)
if mode == "Mode":
return prob_list[num][0]
else:
return prob_list[num][1]
df = df.withColumn("Top1Mode", getProbability("Mode1", "Probability1", "Mode2", "Probability2", "Mode3", "Probability3", "Mode4", "Probability4", lit(0), lit("Mode"))) \
.withColumn("Top1Value", getProbability("Mode1", "Probability1", "Mode2", "Probability2", "Mode3", "Probability3", "Mode4", "Probability4", lit(0), lit("Prob"))) \
.withColumn("Top2Mode", getProbability("Mode1", "Probability1", "Mode2", "Probability2", "Mode3", "Probability3", "Mode4", "Probability4", lit(1), lit("Mode"))) \
.withColumn("Top2Value", getProbability("Mode1", "Probability1", "Mode2", "Probability2", "Mode3", "Probability3", "Mode4", "Probability4", lit(1), lit("Prob"))) \
.withColumn("Top3Mode", getProbability("Mode1", "Probability1", "Mode2", "Probability2", "Mode3", "Probability3", "Mode4", "Probability4", lit(2), lit("Mode"))) \
.withColumn("Top3Value", getProbability("Mode1", "Probability1", "Mode2", "Probability2", "Mode3", "Probability3", "Mode4", "Probability4", lit(2), lit("Prob")))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.