Replace column value based on value in other column, for all rows in a pandas dataframe

Question

I am having trouble thinking pythonically about something, and would love some guidance.

I have a dataframe that contains columns with dates of events at which certain files should be uploaded, and a column with the names of those events. So events can be X, Y, Z, and files can be 1, 2, 3.

Not all files need to be uploaded at all events, ie if it's Event X, then files 1, 2, and 3 need to be uploaded, but if it's Event Y, then only file 3 needs to be uploaded. The date columns either have a date in them, or are blank.

What I want to do is, for all the files for events that are not needed, replace blank with "Not Needed".

Initial:

    File1   File2  File3
X   Aug 1          Sept 1
X   Aug 3   Aug 4  Sept 9
Y                  Sept 10
Z   Aug 12
X   Aug 13  Aug 15
Z   Aug 1

Goal

     File1   File2  File3
X   Aug 1          Sept 1
X   Aug 3   Aug 4  Sept 9
Y   NN      NN     Sept 10
Z   Aug 12  NN     NN
X   Aug 13  Aug 15
Z   Aug 1   NN     NN

So in other words, for the blanks that SHOULD be blank because a file is not expected, replace that value with "Not Needed", while leaving the other blanks alone.

I have tried doing this with .replace(), .apply() with functions, and I am not having any success.

The code below sort of works, but it works not only when there is a match, but even when there is not a match.

Fill in descriptive text for scales not collected at certain visits (where upload dates would be blank)
df_combined['FAQ-Audio-upDate'] = np.where(df_combined['VisitName'] == "Screening", "FAQ Not Expected", "")
df_combined['FAQ-Form-upDate'] = np.where(df_combined['VisitName'] == "Screening", "FAQ Not Expected", "")

How can I change the value in one column based on the value in another column, across the entire dataframe? What I want is basically this:

For every row in the dataframe If the value in the VisitName column == X Change the value in ColumnA to "Not Expected"
Thank you!!

Answer 1

So, I can answer your base question pretty easily, but there's a couple of style things I think you might want to change that I'd like to get into. I'm fairly certain this question has been addressed in other threads, but you've got a couple of problems wrapped up in 1 so I'm just going to address them here

For every row in the dataframe If the value in the VisitName column == X Change the value in ColumnA to "Not Expected"

You want to be using index slices to set values. Get a boolean mask of the dataframe based on the logic you want, use that to create a new dataframe containing only the rows that you want to update, get the index of this new dataframe, and use this index with the original dataframe to change the values over.

    import pandas as pd
    df = pd.DataFrame(data=None, index=["X", "Y", "Z"], columns=["VisitName",
    "ColumnA", "ColumnB"])

    not_expected_index = df[df.loc[:, "VisitName"] == "X"].index
    df.loc[not_expected_index, "ColumnA"] = "Not Expected"

This is the preferred way in pandas to change values in a DataFrame based on other values in another column.

Now, there's a couple of things about the original DataFrame you posted that I'd like to mention. First, if you already have Null values in the dataframe cells, then you can use the pandas dataframe fillna method to populate these values.

    df.fillna("Not Expected")

Second, why do you want to use the string "NN" or "Not Needed" over the default Null value? For any operations within pandas, I prefer to stick with the actual null values, so that you can use aggregation functions like sum or count freely on dataframes with null values.

Second, the index contains repeated values:

    df.index = ["X", "X", "Y", "Z", "X", "Z"]

Dataframes will allow duplicate index values, but they can behave in funny ways that you need to be aware of.

For example:

    print(df)

returns

        VisitName ColumnA ColumnB
    X       NaN     NaN     NaN
    X       NaN     NaN     NaN
    Y       NaN     NaN     NaN
    Z       NaN     NaN     NaN
    X       NaN     NaN     NaN
    Z       NaN     NaN     NaN

setting a value in VisitName for X

    df.loc["X", "VisitName"] = "test"

returns

      VisitName ColumnA ColumnB
    X      "test"   NaN     NaN
    X      "test"   NaN     NaN
    Y       NaN     NaN     NaN
    Z       NaN     NaN     NaN
    X      "test"   NaN     NaN
    Z       NaN     NaN     NaN

If I were tackling this problem, I'd either use the date as the index, with a True or False value in a file's column depending on whether it needs to be sent or not on that date,

     index       File1 File2 File3 
    8/01/17      True  False True
    8/08/17      False True  True
    8/15/17      True  True  False 
    8/24/17      False True  False 
    9/01/17      False False False 
    9/12/17      True  False True

or I'd just use an integer index, with a column for the date and a column for what file is being sent.

  index    date     file
    0      8/01/17   1
    1      8/01/17   2
    2      8/08/17   2
    3      8/15/17   1
    4      8/15/17   2
    5      8/15/17   3

I mean, if you're locked into using the other setup, that's fine, but I think these would be easier dataframe setups to work with, since they'd support groupby methods more easily.

Also, keep in mind that if you're using a for loop, then you might as well not be using pandas. The whole point of pandas is that it uses C to speed up index operations. Never use

    for row in df.index:
        df.loc[row, 'columna'] += 2.

Always use

    df.loc[:, 'columna'] += 2.

Replace column value based on value in other column, for all rows in a pandas dataframe

Question

1 answers

solution1
6 2017-08-28 19:15:14

Replace column value based on value in other column, for all rows in a pandas dataframe

Question

1 answers

solution1 6 2017-08-28 19:15:14

solution1
6 2017-08-28 19:15:14