简体   繁体   中英

Retain only duplicated rows in a pandas dataframe

I have a dataframe with two columns: "Agent" and "Client" Each row corresponds to an interaction between an Agent and a client.

I want to keep only the rows if a client had interactions with at least 2 agents.

How can I do that?

Worth adding that now you can use df.duplicated()

df = df.loc[df.duplicated(subset='Agent', keep=False)]

Use groupby and transform by value_counts .

df[df.Agent.groupby(df.Agent).transform('value_counts') > 1]

Note, that, as mentioned here , you might have one agent interacting with the same client multiple times. This might be retained as a false positive. If you do not want this, you could add a drop_duplicates call before filtering:

df = df.drop_duplicates()
df = df[df.Agent.groupby(df.Agent).transform('value_counts') > 1]

print(df)
   A  B
0  1  2
1  2  5
2  3  1
3  4  1
4  5  5
5  6  1

mask = df.B.groupby(df.B).transform('value_counts') > 1
print(mask)
0    False
1     True
2     True
3     True
4     True
5     True
Name: B, dtype: bool

df = df[mask]
print(df)
   A  B
1  2  5
2  3  1
3  4  1
4  5  5
5  6  1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM