[英]Finding non-unique rows in Pandas Dataframe
Say I have a pandas dataframe like this:假设我有一个这样的熊猫数据框:
Doctor![]() |
Patient![]() |
Days![]() |
---|---|---|
Aaron![]() |
Jeff![]() |
23 ![]() |
Aaron![]() |
Josh![]() |
46 ![]() |
Aaron![]() |
Josh![]() |
71 ![]() |
Jess![]() |
Manny![]() |
55 ![]() |
Jess![]() |
Manny![]() |
85 ![]() |
Jess![]() |
Manny![]() |
46 ![]() |
I want to extract dataframes where a combination of a doctor and a patient occurs more than once.我想提取医生和病人的组合不止一次出现的数据框。 I will be doing further work on the procured dataframes.
我将对采购的数据框做进一步的工作。
So, for instance, in this example, dataframe所以,例如,在这个例子中,数据框
Doctor![]() |
Patient![]() |
Days![]() |
---|---|---|
Aaron![]() |
Josh![]() |
46 ![]() |
Aaron![]() |
Josh![]() |
71 ![]() |
would be extracted AND dataframe将被提取和数据框
Doctor![]() |
Patient![]() |
Days![]() |
---|---|---|
Jess![]() |
Manny![]() |
55 ![]() |
Jess![]() |
Manny![]() |
85 ![]() |
Jess![]() |
Manny![]() |
46 ![]() |
would be extracted .将被提取。
In accordance with my condition, dataframe根据我的情况,数据框
Doctor![]() |
Patient![]() |
Days![]() |
---|---|---|
Aaron![]() |
Jeff![]() |
23 ![]() |
will not be extracted because the combination of Aaron and Jeff occurs only once.不会被提取,因为 Aaron 和 Jeff 的组合只出现了一次。
Now, I have a dataframe that has 400000 rows and the code I have written so far is, I think, inefficient in procuring the dataframes that I want.现在,我有一个包含 400000 行的数据框,我认为到目前为止我编写的代码在获取我想要的数据框方面效率低下。 Here is the code:
这是代码:
doctors = list(df_1.Doctor.unique()) # df_1 being the dataframe with 400K rows
for doctor in doctors:
df_2 = df_1[df_1['Doctor'] == doctor] # extract one sub-dataframe per doctor
patients = list(df_2.Patient.unique())
for patient in patients:
df_3 = df_2[df_2['patient'] == patient] # extract one sub-sub-dataframe per doctor and patient
if len(df_3) >= 2:
# do something
As you can see, this is already verging on O(n^2) runtime(I say verging because there are not 400K unique values in each column).如您所见,这已经接近 O(n^2) 运行时(我说接近是因为每列中没有 400K 唯一值)。 Is there a way to minimize the runtime?
有没有办法最小化运行时间? If so, how can my code be improved?
如果是这样,如何改进我的代码?
Thanks!谢谢!
Umesh乌梅什
You may check with groupby
您可以与
groupby
检查
d = {x : y for x, y in df.groupby(['Doctor','Patient']) if len(y) > 1}
d
Out[36]:
{('Aaron', 'Josh'): Doctor Patient Days
1 Aaron Josh 46
2 Aaron Josh 71, ('Jess', 'Manny'): Doctor Patient Days
3 Jess Manny 55
4 Jess Manny 85
5 Jess Manny 46}
You can use pd.DataFrame.duplicated like so df.loc[df.duplicated()]
.您可以像这样使用pd.DataFrame.duplicated
df.loc[df.duplicated()]
。
This selects rows where all values are duplicated, to choose for specific columns, you can set the subset
parameter:这将选择所有值都重复的行,要为特定列选择,您可以设置
subset
参数:
rows = df.loc[df.duplicated(subset=['doctor', 'patient'])]
here is one way to do it这是一种方法
df2 = (df.groupby(['Doctor','Patient'])['Days'].count() > 1).reset_index()
df2 = df2.drop(df2[df2['Days']==False].index)
df.merge(df2, on=['Doctor','Patient'], suffixes=('','_y')).drop(columns='Days_y')
Doctor Patient Days
0 Aaron Josh 46
1 Aaron Josh 71
2 Jess Manny 55
3 Jess Manny 85
4 Jess Manny 46
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.