简体   繁体   English

无法使用drop_duplicates从DataFrame中删除重复项

[英]Can't remove duplicates from DataFrame with drop_duplicates

So I am using DataFrame from Pandas, python. 所以我正在使用Pandas的DataFrame和python。

The dataframe, I will be referring to was created by the following way: 我将参考的数据帧是通过以下方式创建的:

search = DataFrame([[262,'ny', '20'],[515,'paris','19'],[669,'ldn','10'], [669,'ldn', 10],[669,'ldn',5]],columns = ['subscriber_id','location','radius' ])

title = DataFrame([[262,'director'],[515,'artist'],[669,'scientist']],columns = ['subscriber_id','title' ])

Both the title and search DataFrames are then merged. 然后将标题和搜索DataFrame合并。

mergedTable = merge(title, search, on='subscriber_id', how= 'outer')

This forms the dataframe: 形成数据框:

   subscriber_id      title location radius
0            262   director       ny     20
1            515     artist    paris     19
2            669  scientist      ldn     10
3            669  scientist      ldn     10
4            669  scientist      ldn      5

As we can see it has been merged correctly, so we now have data for a subscriber in multiple rows dependent on their searches. 如我们所见,它已正确合并,因此现在我们可以根据用户的搜索在多行中获取订户的数据。

Now I do not want to get rid of the subscribers having multiple rows with different values, but I do want to get rid of duplicate rows. 现在,我不想摆脱具有多个具有不同值的行的订阅者,但是我确实希望摆脱重复的行。

This is the desired final result: 这是期望的最终结果:

   subscriber_id      title location radius
0            262   director       ny     20
1            515     artist    paris     19
2            669  scientist      ldn     10
4            669  scientist      ldn      5

The row 3, a duplicate of row 2, is removed. 第3行与第2行重复,将被删除。

I have been researching this and it seems that drop_duplicates() should work, ie 我一直在研究这个,似乎drop_duplicates()应该工作,即

mergedTable.drop_duplicates()

But this doesn't work, rows are not removed. 但这不起作用,行也不会删除。 Any tips/solutions available? 有可用的提示/解决方案吗?

Your radius is of dtype object due to some strings within: [669,'ldn','10'] . 由于[669,'ldn','10']某些字符串,您的半径为dtype对象。 And '10' != 10 . '10' != 10 Converting to integer will do the trick: 转换为整数将达到目的:

>>> mergedTable.radius = mergedTable.radius.astype(int)
>>> mergedTable.drop_duplicates()
   subscriber_id      title location  radius
0            262   director       ny      20
1            515     artist    paris      19
2            669  scientist      ldn      10
4            669  scientist      ldn       5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM