I'm familiar with how to drop duplicate rows, and then using the parameter of first
, last
, none
. Nothing too complicated with that and there's plenty of examples (ie here ).
However, what I'm looking for is there a way to find the duplicates, but instead of dropping all duplicates and keeping the first, if I have duplicates, keep all duplicates but drop the first:
So instead of "drop if duplicates, keep the first", I want "keep if duplicates, drop first"
Example:
Given this dataframe, and looking at duplicates in cost
column:
ID name type cost
0 0 a bb 1
1 1 a cc 2 <--- there are duplicates, so drop this row
2 1_0 a dd 2
3 2 a ee 3 <--- there are duplicates, so drop this row
4 2_0 a ff 3
5 2_1 a gg 3
6 2_2 a hh 3
If there are duplicates in the cost
column, just drop the first occurrence, but keep the rest.
So my output would be:
ID name type cost
0 0 a bb 1
2 1_0 a dd 2
4 2_0 a ff 3
5 2_1 a gg 3
6 2_2 a hh 3
Heres the sample dataframe:
import pandas as pd
df = pd.DataFrame([
['0', 'a', 'bb', '1'],
['1', 'a', 'cc', '2'],
['1_0', 'a', 'dd', '2'],
['2', 'a', 'ee', '3'],
['2_0', 'a', 'ff', '3'],
['2_1', 'a', 'gg', '3'],
['2_2', 'a', 'hh', '3']], columns = ['ID', 'name', 'type', 'cost'])
You can chain 2 masks created by DataFrame.duplicated
with bitwise OR
and filter by boolean indexing
:
df = df[df.duplicated('cost') | ~df.duplicated('cost', keep=False)]
print (df)
ID name type cost
0 0 a bb 1
2 1_0 a dd 2
4 2_0 a ff 3
5 2_1 a gg 3
6 2_2 a hh 3
Detail :
print (df.assign(mask1=df.duplicated('cost'), mask2=~df.duplicated('cost', keep=False)))
ID name type cost mask1 mask2
0 0 a bb 1 False True
1 1 a cc 2 False False
2 1_0 a dd 2 True False
3 2 a ee 3 False False
4 2_0 a ff 3 True False
5 2_1 a gg 3 True False
6 2_2 a hh 3 True False
You can do the following with the XOR (^)
operator which looks for both conditions to be True. Since we use the NOT (~)
operator. It looks for the opposite eg: both False
:
df[~(df.cost.duplicated(keep=False) ^ df.cost.duplicated())]
Output
ID name type cost
0 0 a bb 1
2 1_0 a dd 2
4 2_0 a ff 3
5 2_1 a gg 3
6 2_2 a hh 3
You can use groupby
and pass a lambda function to grab the records after the first dupe if the dupe exists:
>>> df.groupby('cost').apply(lambda group: group.iloc[1:] if len(group) > 1 else group).reset_index(drop=True)
ID cost name type
0 0 1 a bb
1 1_0 2 a dd
2 2_0 3 a ff
3 2_1 3 a gg
4 2_2 3 a hh
You can use the following code:
# Import pandas library
import pandas as pd
# initialize list of lists so i can create duplicate datas
data = [['tom', 10], ['nick', 15], ['juli', 14], ['nick', 15], ['julia', 140],
['tom', 10],['tom', 10],['tom', 10]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
# print dataframe.
print(df)
# Now the logic begins from here
colnames=[]
for col in df.columns:
colnames.append(col)
listdf=df.values.tolist()
temp=[]
for i in range(0,len(listdf)):
if(listdf.count(listdf[i])>1 and listdf[i] not in temp):
temp.append(listdf[i])
df = pd.DataFrame(temp, columns =colnames)
print("dataframe with only duplciates ")
print(df)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.