简体   繁体   English

在 Pandas Dataframe 中查找非唯一行

[英]Finding non-unique rows in Pandas Dataframe

Say I have a pandas dataframe like this:假设我有一个这样的熊猫数据框:

Doctor医生 Patient病人 Days
Aaron亚伦 Jeff杰夫 23 23
Aaron亚伦 Josh乔什 46 46
Aaron亚伦 Josh乔什 71 71
Jess杰斯 Manny曼尼 55 55
Jess杰斯 Manny曼尼 85 85
Jess杰斯 Manny曼尼 46 46

I want to extract dataframes where a combination of a doctor and a patient occurs more than once.我想提取医生和病人的组合不止一次出现的数据框。 I will be doing further work on the procured dataframes.我将对采购的数据框做进一步的工作。

So, for instance, in this example, dataframe所以,例如,在这个例子中,数据框

Doctor医生 Patient病人 Days
Aaron亚伦 Josh乔什 46 46
Aaron亚伦 Josh乔什 71 71

would be extracted AND dataframe将被提取和数据框

Doctor医生 Patient病人 Days
Jess杰斯 Manny曼尼 55 55
Jess杰斯 Manny曼尼 85 85
Jess杰斯 Manny曼尼 46 46

would be extracted .将被提取

In accordance with my condition, dataframe根据我的情况,数据框

Doctor医生 Patient病人 Days
Aaron亚伦 Jeff杰夫 23 23

will not be extracted because the combination of Aaron and Jeff occurs only once.不会被提取,因为 Aaron 和 Jeff 的组合只出现了一次。

Now, I have a dataframe that has 400000 rows and the code I have written so far is, I think, inefficient in procuring the dataframes that I want.现在,我有一个包含 400000 行的数据框,我认为到目前为止我编写的代码在获取我想要的数据框方面效率低下。 Here is the code:这是代码:

    doctors = list(df_1.Doctor.unique()) # df_1 being the dataframe with 400K rows 
    for doctor in doctors:
        df_2 = df_1[df_1['Doctor'] == doctor] # extract one sub-dataframe per doctor
        patients = list(df_2.Patient.unique())
        for patient in patients:
            df_3 = df_2[df_2['patient'] == patient] # extract one sub-sub-dataframe per doctor and patient
            if len(df_3) >= 2:
                # do something

As you can see, this is already verging on O(n^2) runtime(I say verging because there are not 400K unique values in each column).如您所见,这已经接近 O(n^2) 运行时(我说接近是因为每列中没有 400K 唯一值)。 Is there a way to minimize the runtime?有没有办法最小化运行时间? If so, how can my code be improved?如果是这样,如何改进我的代码?

Thanks!谢谢!

Umesh乌梅什

You may check with groupby您可以与groupby检查

d = {x : y  for x, y in df.groupby(['Doctor','Patient']) if len(y) > 1}
d
Out[36]: 
{('Aaron', 'Josh'):   Doctor Patient  Days
 1  Aaron    Josh    46
 2  Aaron    Josh    71, ('Jess', 'Manny'):   Doctor Patient  Days
 3   Jess   Manny    55
 4   Jess   Manny    85
 5   Jess   Manny    46}

You can use pd.DataFrame.duplicated like so df.loc[df.duplicated()] .您可以像这样使用pd.DataFrame.duplicated df.loc[df.duplicated()]

This selects rows where all values are duplicated, to choose for specific columns, you can set the subset parameter:这将选择所有值都重复的行,要为特定列选择,您可以设置subset参数:

rows = df.loc[df.duplicated(subset=['doctor', 'patient'])]

here is one way to do it这是一种方法

df2 = (df.groupby(['Doctor','Patient'])['Days'].count() > 1).reset_index()
df2 = df2.drop(df2[df2['Days']==False].index)
df.merge(df2, on=['Doctor','Patient'], suffixes=('','_y')).drop(columns='Days_y')
    Doctor  Patient     Days
0   Aaron   Josh        46
1   Aaron   Josh        71
2   Jess    Manny       55
3   Jess    Manny       85
4   Jess    Manny       46

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM