[英]Pandas de-duplication and return list of indexes that were duplicates
I have a pandas dataframe with 500k rows. 我有一个50万行的熊猫数据框。 Structured like this, where the document
column are strings: 像这样的结构,其中document
列是字符串:
document_id document
0 0 Here is our forecast
1 1 Traveling to have a business meeting takes the...
2 2 test successful. way to go!!!
3 3 Randy, Can you send me a schedule of the salar...
4 4 Let's shoot for Tuesday at 11:45.
When I de-dupe the dataframe based on the contents of the document column using df.drop_duplicates(subset='document')
, I end up with half the number of documents. 当我使用df.drop_duplicates(subset='document')
根据document列的内容对数据df.drop_duplicates(subset='document')
进行重复数据删除时,最终得到的文档数只有一半。
Now that I have my original dataframe and a second dataframe with the unique set of document
values, I would like to compare the two to get a list of document_id
's that are duplicates. 现在,我有了原始数据框和第二个数据框,其中包含一组唯一的document
值,我想将两者进行比较,以获得重复的document_id
列表。
For example, if the associated document
for document_id
4, 93, and 275 are all 'Let's shoot for Tuesday at 11:45.', then how do I get a dataframe with document
in one column, and list of associated duplicate document_id
's in another column? 例如,如果相关document
的document_id
275 4,93,并且都是“让我们拍周二11时45分。”,然后我怎么得到一个数据帧document
在一列,以及相关的重复列表document_id
的在另一栏中?
document_ids document
...
4 [4, 93, 275] Let's shoot for Tuesday at 11:45.
I know that I could use a for loop, and compare each document every other document in the dataframe, and save all matches, but I am trying to avoid iterating over 500k lines multiple times. 我知道我可以使用for循环,将每个文档与数据框中的每个其他文档进行比较,并保存所有匹配项,但是我试图避免多次迭代超过500k行。 What instead is the most pythonic way of going about this? 取而代之的是最pythonic的方式是什么?
I would like to compare the two to get a list of document_id's that are duplicates. 我想将两者进行比较,以获取重复的document_id列表。
You should be able to do this using your "initial" DataFrame with .duplicated(keep=False)
. 您应该可以使用带有.duplicated(keep=False)
“初始” DataFrame来执行此操作。 Here's an example: 这是一个例子:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({
...: 'document_id': range(10),
...: 'document': list('abcabcdedb') # msg 'e' is not duplicated
...: })
In [3]: dupes = df.document.duplicated(keep=False)
In [4]: df.loc[dupes].groupby('document')['document_id'].apply(list).reset_index()
Out[4]:
document document_id
0 a [0, 3]
1 b [1, 4, 9]
2 c [2, 5]
3 d [6, 8]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.