Drop duplicates of a column where null value is present

Question

I have a dataframe df1 and column 1 (col1) contains customer id. Col2 is filled with sales and some of the values are missing

My problem is that I want to drop duplicate customer ids in col1 only where the value of sales is missing.

I tried writing a function saying:

def drop(i):
          if i[col2] == np.nan:
             i.drop_duplicates(subset = 'col1')
          else:
             return i['col1']

I am getting an error saying truth value of series is ambiguous

Thank you for reading. Would appreciate a solution

Answer 1

Following should work, using groupby , apply , dropna , reset_index

assuming your data is something like this

input:

col1    col2
0   1001    2.0
1   1001    NaN
2   1002    4.0
3   1002    NaN

code:

import pandas as pd
import numpy as np

#Dummy data
data = {
    'col1':[1001,1001,1002,1002],
    'col2':[2,np.nan,4,np.nan],
}

df = pd.DataFrame(data)

#Solution
df.groupby('col1').apply(lambda group: group.dropna(subset=['col2'])).reset_index(drop=True)

output:

col1    col2
0   1001    2.0
1   1002    4.0

Drop duplicates of a column where null value is present

Question

1 answers

solution1
1 2021-02-26 14:56:23

Drop duplicates of a column where null value is present

Question

1 answers

solution1 1 2021-02-26 14:56:23

solution1
1 2021-02-26 14:56:23