簡體   English   中英

如何確定每行的前 3 列值

[英]How to determine top 3 column values for each row

我有一個 dataframe 的格式

| ID | Payer | Payee | Mode1 | Probability1 | Mode2 | Probability2 | Mode3  | Probability3 | Mode4  | Probability4 | Month  |
|----|-------|-------|-------|--------------|-------|--------------|--------|--------------|--------|--------------|--------|
| 1  | xyz   | wqu   | cash  | 0.16         | wire  | 0.89         | upi    | 0.81         | cheque | 0.69         | 201801 |
| 2  | wqu   | xyz   | wire  | 0.28         | cash  | 0.19         | upi    | 0.77         | cheque | 0.58         | 201801 |
| 3  | pqr   | xyz   | upi   | 0.35         | cash  | 0.11         | cheque | 0.48         | wire   | 0.66         | 201803 |

概率列具有模式列的對應值

現在我想按列為每一行獲取前 3 個概率值

像這樣的東西,

| ID | Payer | Payee | Mode1 | Probability1 | Mode2 | Probability2 | Mode3  | Probability3 | Mode4  | Probability4 | Month  | Top1Mode | Top1Value | Top2Mode | Top2Value | Top3Mode | Top3Value |
|----|-------|-------|-------|--------------|-------|--------------|--------|--------------|--------|--------------|--------|----------|-----------|----------|-----------|----------|-----------|
| 1  | xyz   | wqu   | cash  | 0.16         | wire  | 0.89         | upi    | 0.81         | cheque | 0.69         | 201801 | wire     | 0.89      | upi      | 0.81      | cheque   | 0.69      |
| 2  | wqu   | xyz   | wire  | 0.28         | cash  | 0.19         | upi    | 0.77         | cheque | 0.58         | 201801 | upi      | 0.77      | cheque   | 0.58      | wire     | 0.28      |
| 3  | pqr   | xyz   | upi   | 0.35         | cash  | 0.11         | cheque | 0.48         | wire   | 0.66         | 201803 | wire     | 0.66      | cheque   | 0.48      | upi      | 0.35      |

如果表格不可見在此處輸入圖像描述

為了進一步解釋,對於第 1 行或 ID 1。電線具有最高概率(即 0.89),因此它位於 Top1Mode 列中,其值位於下一列中。 類似地,UPI 具有第二高的概率,因此它在 Top2Mode 列中以及它在下一列中的值(即 Top2Value)

使用 Pandas 或 PySpark 進行操作,它們中的任何一個都適合我

我能想到的一件事是使用 UDF(但我想看看是否有人有更好的解決方案):

@UDF
def getProbability(Mode1, Probability1, Mode2, Probability2, Mode3, Probability3, Mode4, Probability4, num, mode):
    prob_list = []
    prob_list.append((Mode1, Probability1))
    prob_list.append((Mode2, Probability2))
    prob_list.append((Mode3, Probability3))
    prob_list.append((Mode4, Probability4))
    prob_list = sorted(prob_list, key = lambda x: x[1], reverse=True)
        
    if mode == "Mode":
        return prob_list[num][0]
    else:
        return prob_list[num][1]

df = df.withColumn("Top1Mode", getProbability("Mode1", "Probability1", "Mode2", "Probability2", "Mode3", "Probability3", "Mode4", "Probability4", lit(0), lit("Mode"))) \
       .withColumn("Top1Value", getProbability("Mode1", "Probability1", "Mode2", "Probability2", "Mode3", "Probability3", "Mode4", "Probability4", lit(0), lit("Prob"))) \
       .withColumn("Top2Mode", getProbability("Mode1", "Probability1", "Mode2", "Probability2", "Mode3", "Probability3", "Mode4", "Probability4", lit(1), lit("Mode"))) \
       .withColumn("Top2Value", getProbability("Mode1", "Probability1", "Mode2", "Probability2", "Mode3", "Probability3", "Mode4", "Probability4", lit(1), lit("Prob"))) \
       .withColumn("Top3Mode", getProbability("Mode1", "Probability1", "Mode2", "Probability2", "Mode3", "Probability3", "Mode4", "Probability4", lit(2), lit("Mode"))) \
       .withColumn("Top3Value", getProbability("Mode1", "Probability1", "Mode2", "Probability2", "Mode3", "Probability3", "Mode4", "Probability4", lit(2), lit("Prob")))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM