[英]Remove duplicates of pandas df
Trying use the DataFrame.drop_duplicates parameters but without luck as the duplicates are not being removed.尝试使用 DataFrame.drop_duplicates 参数但没有运气,因为重复项没有被删除。
Looking to remove based on column "inc_id".希望根据列“inc_id”删除。 If find duplicates in that column should keep only the last row.
如果在该列中找到重复项,则应仅保留最后一行。
My df is:我的 df 是:
inc_id inc_cr_date
0 1049670 121
1 1049670 55
2 1049667 121
3 1049640 89
4 1049666 12
5 1049666 25
Output should be:输出应该是:
inc_id inc_cr_date
0 1049670 55
1 1049667 121
2 1049640 89
3 1049666 25
Code is:代码是:
df = df.drop_duplicates(subset='inc_id', keep="last")
Any idea what am I missing here?知道我在这里缺少什么吗? Thanks.
谢谢。
I think you are just looking to drop the original index :我认为您只是想删除原始索引:
In [11]: df.drop_duplicates(subset='inc_id', keep="last").reset_index(drop=True)
Out[11]:
inc_id inc_cr_date
0 1049670 55
1 1049667 121
2 1049640 89
3 1049666 25
For dataframe df, duplicate rows can be dropped using this code.对于数据帧 df,可以使用此代码删除重复的行。
df = pd.read_csv('./data/data-set.csv')
print(df['text'])
def clean_data(dataframe):
# Drop duplicate rows
dataframe.drop_duplicates(subset='text', inplace=True)
clean_data(df)
print(df['text'])
f.drop_duplicates(subset='inc_id', keep="last").reset_index(drop=True)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.