简体   繁体   中英

How to match rows based on certain columns in pandas?

I have a dataframe like this:

id     date          event    name     time
1      2016-10-01    A        leader   12:45
2      2016-10-01    A        AA       12:87
3      2016-10-01    A        BB       12:45

There are rows for each member in the event, but one row has the leader data as well. I want to exclude the rows with the data about the leader and add a column is_leader to indicate whether a member is the leader or not. Something like this:

id     date          event    name     time    is_leader
2      2016-10-01    A        AA       12:87   0
3      2016-10-01    A        BB       12:45   1

So, I know at id=3 is the leader based on the time, which is 12:45 for both here. We can assume that this time won't be the same for any other members.

What is an efficient way to accomplish this in pandas. Here I have just one event as an example, but I'll have several of these and I need to do this for each event.

You can use groupby with custom function f which return new column is_leader with True for all rows where is same time as time of row with text leader in column name :

print (df)
   id       date event    name   time
0   1 2016-10-01     A  leader  12:45
1   2 2016-10-01     A      AA  12:87
2   3 2016-10-01     A      BB  12:45
3   1 2016-10-01     B  leader  12:15
4   2 2016-10-01     B      AA  12:15
5   3 2016-10-01     B      BB  12:45

def f(x):
    x['is_leader'] = x.time == x.ix[x['name'] == 'leader', 'time'].iloc[0]
    return x

df= df.groupby('event').apply(f)
print (df)
   id       date event    name   time is_leader
0   1 2016-10-01     A  leader  12:45      True
1   2 2016-10-01     A      AA  12:87     False
2   3 2016-10-01     A      BB  12:45      True
3   1 2016-10-01     B  leader  12:15      True
4   2 2016-10-01     B      AA  12:15      True
5   3 2016-10-01     B      BB  12:45     False

One row solution with lambda function:

df['is_leader'] = df.groupby('event')
                    .apply(lambda x: x.time == x.ix[x['name'] == 'leader', 'time'].iloc[0])
                    .reset_index(drop=True, level=0)
print (df)
   id       date event    name   time is_leader
0   1 2016-10-01     A  leader  12:45      True
1   2 2016-10-01     A      AA  12:87     False
2   3 2016-10-01     A      BB  12:45      True
3   1 2016-10-01     B  leader  12:15      True
4   2 2016-10-01     B      AA  12:15      True
5   3 2016-10-01     B      BB  12:45     False

Then remove rows with leader by boolean indexing and cast boolean column to int :

df = df[df.name != 'leader']
df.is_leader = df.is_leader.astype(int)
print (df)
   id       date event name   time  is_leader
1   2 2016-10-01     A   AA  12:87          0
2   3 2016-10-01     A   BB  12:45          1
4   2 2016-10-01     B   AA  12:15          1
5   3 2016-10-01     B   BB  12:45          0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM