简体   繁体   中英

Pandas - Dataframe - Conditional add

I want to add a new column in my data frame. I have a list of events and if any of these is different from 0 the value of the row in the new column should be 1.

I think it should be very simple, but i am fairly new to python.

The dataframe looks like this:

df = pd.DataFrame({"ID":[1,1,2,3],"Date":["01/01/2019","01/01/2019","02/01/2019","02/01/2019"],"Event_1":[1,0,0,0],"Event_2":[1,0,0,1],"Event_3":[0,1,0,1],"Other":[0,0,0,1]})

print(df)
ID    Date         Event_1 Event_2 Event_3 Other
1     01/01/2019   1       1       0       0
1     01/01/2019   0       0       1       0
2     02/01/2019   0       0       0       0
3     02/01/2019   0       1       1       1

And should look like this:

ID    Date         Event_1 Event_2 Event_3 Other Conditional_row
1     01/01/2019   1       1       0       0     1
1     01/01/2019   0       0       1       0     1
2     02/01/2019   0       0       0       0     0
3     02/01/2019   0       1       1       1     1

What is the easiest way of doing it? What is the best?

Use filter + any

Since all non-zero integers are Truthy in Python, calling any directly on your DataFrame results in the correct mask. Since you want an integer output, we can use a memory efficient view to view the boolean mask as a integer type.


df.filter(like="Event").any(1).view('i1')

0    1
1    1
2    0
3    1
dtype: int8

Using DataFrame.filter , eq and any

First we filter the columns which start with Event or Other . Then we check if any of the rows are eq (equal) to 1 :

df['Conditional_row'] = df.filter(regex="^Event|^Other").eq(1).any(axis=1).astype(int)
   ID        Date  Event_1  Event_2  Event_3  Other  Conditional_row
0   1  01/01/2019        1        1        0      0                1
1   1  01/01/2019        0        0        1      0                1
2   2  02/01/2019        0        0        0      0                0
3   3  02/01/2019        0        1        1      1                1

Or use:

df['Conditional_row'] = df[['Event_1', 'Event_2', 'Event_3', 'Other']].ne(0).any(1).astype(int)

And now:

print(df)

Output:

   ID        Date  Event_1  Event_2  Event_3  Conditional_row
0   1  01/01/2019        1        1        0                1
1   1  01/01/2019        0        0        1                1
2   2  02/01/2019        0        0        0                0
3   3  02/01/2019        0        1        1                1

Suppose your data frame is stored in an object called df . I believe this is the most efficient way to do this:

df["Conditional_row"] = 0
df.loc[df[["Event_1","Event_2","Event_3","Other"]].sum(axis=1) > 0, "Conditional_row"] = 1

The output looks like this:

print(df)
   ID        Date  Event_1  Event_2  Event_3  Other  Conditional_row
0   1  01/01/2019        1        1        0      0                1
1   1  01/01/2019        0        0        1      0                1
2   2  02/01/2019        0        0        0      0                0
3   3  02/01/2019        0        1        1      1                1

What I did here was:

  1. I created a new column filled with zeroes.
  2. I selected all the rows where the row-wise sum of the columns in the list ["Event_1","Event_2","Event_3","Other"] is greater than 1.
  3. The column "Conditional_row" of the rows that meet that condition are updated with the value 1.

The code df[["Event_1","Event_2","Event_3","Other"]].sum(axis=1) > 0 is called a mask and it returns a boolean array (a vector filled with True and False values). It selects all the rows where the return value is True . Typically, slicing using boolean arrays is the most efficient way to manipulate data frames.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM