简体   繁体   中英

Python DataFrames - Help needed with creating a new column based on several conditionals

I have a challenges DataFrame from the Great British Baking Show. Feel free to download the dataset:

pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-10-25/challenges.csv")

I've cleaned up the table and now have columns of series (1 through 10), episode (6 through 10), baker (names of each baker), and result (what happened to the baker each week (eliminated vs still on the show)). I am looking for a solution that allows me to add a new column called final_score that will list the final placement of each baker for each series.

In english what I am trying to do is:

  1. Count the unique number of bakers per a series.
  2. For each series, for each episode, if result == 'OUT' , add a column to the DF that records the baker's final score. The first score from each season will be equal to the count of bakers from step 1. I will then subtract the total baker count by 1.

As am example, the number of bakers from season 1 is 10. In episode 1, both Lea and Mark were eliminated so I want 'final_score' to read 10 for both of them. In episode 2, both A.netha and Louise were eliminated so I want their score to read 8.

I've spent most of the day on this problem and I'm fairly stuck. I've tried window functions, apply functions, list comprehension but the closest I've gotten is pasted below. With attempt 1, I know the problem is at: if df.result =='OUT': . I understand that this is a series but I've tried .result.items() , result.all() , result.any() , if df.loc[df.result] == 'OUT': but nothing seems to work.

Attempt 1

def final_score(df):
#count the number of bakers per season
    baker_count = df.groupby('series')['baker'].nunique()
    #for each season
    for s in df.series:  
        #create a interable that counts the number of bakers that have been eliminated. Start at 0
        bakers_out = 0
        bakers_remaining = baker_count[int(s)]
        #for each season
        for e in df.episode:
            #does result say OUT for each contestant?
            if df.result =='OUT':
            
           
                    df['final_score'] = bakers_remaining
                    #if so, then we'll add +1 to our bakers_out iterator. 
                    bakers_out +=1

                    #set the final score category to our baker_count iterator
                    df['final_score'] = bakers_remaining

                    #subtract the number of bakers left by the amount we just lost
                    bakers_remaining -= bakers_out
            else:
                next
    return df

Attempt 2 wasn't about me creating a new dataframe but rather trying to trouble shoot this problem and print out my desired output to the console. This is pretty close but I want the final result to be a dense scoring so the two bakers that got out in series 1, episode 1 should both end up in 10th place, and the two bakers that got out the following week should both show 8th place.

baker_count = df.groupby('series')['baker'].nunique()

#for each series
for s in df.series.unique():  
    bakers_out = 0
    bakers_remaining = baker_count[int(s)]
    #for each episode
    for e in df.episode.unique():
        #create a list of results
        data_results = list(df[(df.series==s) & (df.episode==e)].result)
        for dr in data_results:
            if dr =='OUT':
                bakers_out += 1
                print (s,e,dr,';final place:',bakers_remaining,';bakers out:',bakers_out)  
            else:
                print (s,e,dr,'--')
        bakers_remaining -= 1



Snippet of the result

1.0 1.0 IN --
1.0 1.0 IN --
1.0 1.0 IN --
1.0 1.0 IN --
1.0 1.0 IN --
1.0 1.0 OUT ;final place: 10 ;bakers out: 1
1.0 1.0 OUT ;final place: 10 ;bakers out: 2
1.0 2.0 IN --
1.0 2.0 IN --
1.0 2.0 IN --
1.0 2.0 IN --
1.0 2.0 IN --
1.0 2.0 IN --
1.0 2.0 OUT ;final place: 9 ;bakers out: 3
1.0 2.0 OUT ;final place: 9 ;bakers out: 4

Thanks everyone and please let me know what other information I should provide.

You could try the following ( df your dataframe):

m = df["result"].eq("OUT")
df["final_score"] = (
    df.groupby("series")["baker"].transform("nunique")
    - df[m].groupby("series")["baker"].cumcount()
)
df["final_score"] = df[m].groupby(["series", "episode"])["final_score"].transform("max")

Result for the first 2 seasons (not all columns):

print(df[m & df["series"].isin([1, 2])])
     series  episode      baker result  final_score
8         1        1        Lea    OUT         10.0
9         1        1       Mark    OUT         10.0
16        1        2    Annetha    OUT          8.0
17        1        2     Louise    OUT          8.0
25        1        3   Jonathan    OUT          6.0
34        1        4      David    OUT          5.0
43        1        5  Jasminder    OUT          4.0
70        2        1      Keith    OUT         12.0
81        2        2      Simon    OUT         11.0
91        2        3        Ian    OUT         10.0
92        2        3    Urvashi    OUT         10.0
101       2        4        Ben    OUT          8.0
112       2        5      Jason    OUT          7.0
113       2        5     Robert    OUT          7.0
123       2        6     Yasmin    OUT          5.0
135       2        7      Janet    OUT          4.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM