简体   繁体   English

如何在熊猫中一起使用distinct和where子句?

[英]How to use distinct and where clause together in Pandas?

I have a dataframe and list as given below 我有一个数据框和列表如下

op1 = pd.DataFrame({
'subject_id':[1,1,2,3,4,4,5],
'iid': [21,22,23,24,26,26,27],
'los':[121,122,123,124,111,111,131],
'area':['a','a','b','c','d','d','f'],
'date' : ['1/1/2017','1/2/2017','1/3/2017','1/4/2017','1/6/2017','1/6/2017','1/8/2109'],
'val' :[5,10,5,16,26,26,7]
 })

sub_list = [1,2,3,4]

I would like to check whether the subject_id from sub_list is present in op1 . 我想检查sub_listsubject_id是否存在于op1 If present, then get the distinct values from columns los , iid , area for that subject_id (look for difference between subject_id 1 and 4 (which has duplicates) 如果存在,则从losiidarea获取该subject_id的distinct值(查找subject_id 1 and 4之间的差异(重复)

I tried the below but couldn't have multiple columns 我尝试了以下内容,但不能有多列

op1[op1['subject_id'].isin(sub_list)] # how to use distinct records here?

I have to apply this to a million records. 我必须将此应用于一百万条记录。 So any elegant and efficient solution is helpful 因此,任何优雅高效的解决方案都是有帮助的

I am looking for something like 我正在寻找类似的东西

select distinct subject_id, iid,los, area from op1
where subject_id in [sub_list] 

I expect my output to be as shown below 我希望我的输出如下所示

在此处输入图片说明

如果打算仅返回选定的列,请执行以下操作:

result = op1.loc[op1["subject_id"].isin(sub_list), ["subject_id", "los", "iid", "area"]].drop_duplicates()

I'm not sure how fast this is, but you can try: 我不确定这有多快,但是您可以尝试:

(op1[['subject_id','iid','los','area']]
     .drop_duplicates(['subject_id','iid','los','area'])
     .set_index('subject_id')
     .loc[sub_list]
)
op1[op1['subject_id'].isin(sub_list)].drop_duplicates(subset=list_columns_to_distinct)

It's actually a mix of the previous answers 这实际上是先前答案的混合

distCols = ["subject_id", "iid",
            "los", "area"]

op1[op1['subject_id'].isin(sub_list)].drop_duplicates(distCols)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM