熊猫重复数据删除并返回重复的索引列表

Question

I have a pandas dataframe with 500k rows. 我有一个50万行的熊猫数据框。 Structured like this, where the document column are strings: 像这样的结构，其中document列是字符串：

   document_id                                           document
0            0                               Here is our forecast
1            1  Traveling to have a business meeting takes the...
2            2                      test successful. way to go!!!
3            3  Randy, Can you send me a schedule of the salar...
4            4                  Let's shoot for Tuesday at 11:45.

When I de-dupe the dataframe based on the contents of the document column using df.drop_duplicates(subset='document') , I end up with half the number of documents. 当我使用df.drop_duplicates(subset='document')根据document列的内容对数据df.drop_duplicates(subset='document')进行重复数据删除时，最终得到的文档数只有一半。

Now that I have my original dataframe and a second dataframe with the unique set of document values, I would like to compare the two to get a list of document_id 's that are duplicates. 现在，我有了原始数据框和第二个数据框，其中包含一组唯一的document值，我想将两者进行比较，以获得重复的document_id列表。

For example, if the associated document for document_id 4, 93, and 275 are all 'Let's shoot for Tuesday at 11:45.', then how do I get a dataframe with document in one column, and list of associated duplicate document_id 's in another column? 例如，如果相关document的document_id 275 4，93，并且都是“让我们拍周二11时45分。”，然后我怎么得到一个数据帧document在一列，以及相关的重复列表document_id的在另一栏中？

     document_ids                                           document    
        ...
4    [4, 93, 275]                  Let's shoot for Tuesday at 11:45.

I know that I could use a for loop, and compare each document every other document in the dataframe, and save all matches, but I am trying to avoid iterating over 500k lines multiple times. 我知道我可以使用for循环，将每个文档与数据框中的每个其他文档进行比较，并保存所有匹配项，但是我试图避免多次迭代超过500k行。 What instead is the most pythonic way of going about this? 取而代之的是最pythonic的方式是什么？

Answer 1

I would like to compare the two to get a list of document_id's that are duplicates. 我想将两者进行比较，以获取重复的document_id列表。

You should be able to do this using your "initial" DataFrame with .duplicated(keep=False) . 您应该可以使用带有.duplicated(keep=False) “初始” DataFrame来执行此操作。 Here's an example: 这是一个例子：

In [1]: import pandas as pd                                                                                                                                   

In [2]: df = pd.DataFrame({ 
   ...:     'document_id': range(10), 
   ...:     'document': list('abcabcdedb') # msg 'e' is not duplicated
   ...: })

In [3]: dupes = df.document.duplicated(keep=False)                                                                                                            
In [4]: df.loc[dupes].groupby('document')['document_id'].apply(list).reset_index()                                                                           
Out[4]: 
  document document_id
0        a      [0, 3]
1        b   [1, 4, 9]
2        c      [2, 5]
3        d      [6, 8]

熊猫重复数据删除并返回重复的索引列表

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-11-20 23:12:15

熊猫重复数据删除并返回重复的索引列表

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-11-20 23:12:15

解决方案1
0 已采纳 2018-11-20 23:12:15