简体   繁体   中英

How to create a new column based on one of three other columns?

I have a Dataframe that has a movie name column and 3 other columns (let's call them A, B, and C) that are ratings from 3 different sources. There are many movies with only one rating, some movies with a combination from the 3 forums, and some with no ratings. I want to create a new column that will:

  1. If A column has associated rating, use A.
  2. If A column is empty, get associated rating from B.
  3. If B column is empty, get associated rating from C.
  4. If C column is empty, return "Unrated"

This is what I have in my code so far:

def check_rating(rating):
    if newyear['Yahoo Rating'] != "\\N":
        return rating
    else:
        if newyear['Movie Mom Rating'] != "\\N":
            return rating
        else:
            if newyear['Critc Rating'] != "\\N":
                return rating
            else:
                return "Unrated"

df['Rating'] = df.apply(check_rating, axis=1)

The error I get is:

ValueError: ('The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', 'occurred at index 0')

For visual of my dataframe, here is newyear.head() :

新年数据框

I am not sure what this value error means to fix this problem and also if this is the right way to do it.

I would do something like this:

df = df.replace('\\N', np.nan)  # this requires import numpy as np
(df['Yahoo Rating'].fillna(df['Movie Mom Rating']
                   .fillna(df['Critic Rating']
                   .fillna("Unrated"))))

The reason that your code doesn't work is that newyear['Yahoo Rating'] != "\\\\N" is a boolean array. What you say here is something like if [True, False, True, False]: . That's the source of ambiguity. How do you evaluate such a condition? Would you execute if all of them True or would just one of them be enough?

As M. Klugerford explained , you can change it so it is evaluated row by row (therefore returns a single value). However, row by row apply operations are generally slow and pandas has great tools for handling missing data. That's why I am suggesting this.

You are returning rating in your original function .. but rating is the row , not the value of any column

>>> df
    A   B   C Genre Title Year
0   7   6  \N    g1    m1   y1
1  \N   5   7    g2    m2   y2
2  \N  \N  \N    g3    m3   y3
3  \N   4   1    g4    m4   y4
4  \N  \N   3    g5    m5   y5

>>> def rating(row):
    if row['A'] != r'\N':
        return row['A']
    if row['B'] != r'\N':
        return row['B']
    if row['C'] != r'\N':
        return row['C']
    return 'Unrated'

>>> df['Rating'] = df.apply(rating, axis = 1)
>>> df
    A   B   C Genre Title Year   Rating
0   7   6  \N    g1    m1   y1        7
1  \N   5   7    g2    m2   y2        5
2  \N  \N  \N    g3    m3   y3  Unrated
3  \N   4   1    g4    m4   y4        4
4  \N  \N   3    g5    m5   y5        3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM