Pandas drop duplicates on one column and keep only rows with the most frequent value in another column

Question

I have a dataframe that looks as the following:

ip_address    malware_type
ip_1          malware_1
ip_2          malware_2
ip_1          malware_1
ip_1          malware_1
ip_1          malware_2
ip_2          malware_2
ip_2          malware_3
.
.
.

I want to drop duplicate rows based on the 'ip_address' column, however when the dropping occurs, I want to keep only the 'malware_type' value that is the most frequent for each IP. So the resulting dataframe should look like:

ip_address    malware_type
ip_1          malware_1
ip_2          malware_2
.
.
.

I would really appreciate any help to achieve the above. Thanks.

Answer 1

Let us try mode

s=df.groupby('ip_address').malware_type.agg(lambda x : x.mode()[0]) # .reset_index()
Out[56]: 
ip_address
ip_1    malware_1
ip_2    malware_2
Name: malware_type, dtype: object

Answer 2

I have a dataframe that looks as the following:

ip_address    malware_type
ip_1          malware_1
ip_2          malware_2
ip_1          malware_1
ip_1          malware_1
ip_1          malware_2
ip_2          malware_2
ip_2          malware_3
.
.
.

I want to drop duplicate rows based on the 'ip_address' column, however when the dropping occurs, I want to keep only the 'malware_type' value that is the most frequent for each IP. So the resulting dataframe should look like:

ip_address    malware_type
ip_1          malware_1
ip_2          malware_2
.
.
.

I would really appreciate any help to achieve the above. Thanks.

Answer 3

You can use GroupBy.agg with pd.Series.mode

df.groupby('ip_address').malware_type.agg(pd.Series.mode)

ip_address
ip_1    malware_1
ip_2    malware_2
Name: malware_type, dtype: object

You can use scipy.stats.mode here.

from scipy.stats import mode
df.groupby('ip_address').malware_type.agg(lambda x: mode(x).mode)

ip_address
ip_1    malware_1
ip_2    malware_2
Name: malware_type, dtype: object

Another is to use collection.Counter 's most_common method.

def md(s):
    c = Counter(s)
    return c.most_common(1)[0][0]

df.groupby('ip_address').malware_type.agg(md)

ip_address
ip_1    malware_1
ip_2    malware_2
Name: malware_type, dtype: object

Pandas drop duplicates on one column and keep only rows with the most frequent value in another column

Question

2 answers

solution1
4 ACCPTED 2020-08-08 18:58:25

solution2
0 2020-08-08 19:00:59

solution3
0 2020-08-08 19:04:37

Pandas drop duplicates on one column and keep only rows with the most frequent value in another column

Question

2 answers

solution1 4 ACCPTED 2020-08-08 18:58:25

solution2 0 2020-08-08 19:00:59

solution3 0 2020-08-08 19:04:37

solution1
4 ACCPTED 2020-08-08 18:58:25

solution2
0 2020-08-08 19:00:59

solution3
0 2020-08-08 19:04:37