I have a dataframe and list as given below
op1 = pd.DataFrame({
'subject_id':[1,1,2,3,4,4,5],
'iid': [21,22,23,24,26,26,27],
'los':[121,122,123,124,111,111,131],
'area':['a','a','b','c','d','d','f'],
'date' : ['1/1/2017','1/2/2017','1/3/2017','1/4/2017','1/6/2017','1/6/2017','1/8/2109'],
'val' :[5,10,5,16,26,26,7]
})
sub_list = [1,2,3,4]
I would like to check whether the subject_id
from sub_list
is present in op1
. If present, then get the distinct
values from columns los
, iid
, area
for that subject_id (look for difference between subject_id
1
and
4
(which has duplicates)
I tried the below but couldn't have multiple columns
op1[op1['subject_id'].isin(sub_list)] # how to use distinct records here?
I have to apply this to a million records. So any elegant and efficient solution is helpful
I am looking for something like
select distinct subject_id, iid,los, area from op1
where subject_id in [sub_list]
I expect my output to be as shown below
如果打算仅返回选定的列,请执行以下操作:
result = op1.loc[op1["subject_id"].isin(sub_list), ["subject_id", "los", "iid", "area"]].drop_duplicates()
I'm not sure how fast this is, but you can try:
(op1[['subject_id','iid','los','area']]
.drop_duplicates(['subject_id','iid','los','area'])
.set_index('subject_id')
.loc[sub_list]
)
op1[op1['subject_id'].isin(sub_list)].drop_duplicates(subset=list_columns_to_distinct)
It's actually a mix of the previous answers
distCols = ["subject_id", "iid",
"los", "area"]
op1[op1['subject_id'].isin(sub_list)].drop_duplicates(distCols)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.