[英]how group by and filter multiple strings with Pandas Dataframe?
I'm a beginner for coding and I've tried to look for answers for a few days but I didn't succeed what I want to do so sorry in advance if it's easy or if it already exists somewhere... Let's say I have a df1 with columns: series_id and lesion_name and I would like to obtain a df2 by replacing the df1 with series_id and lung_ref_seg columns.我是编码的初学者,几天来我一直试图寻找答案,但我没有成功我想做的事,如果它很容易或者它已经存在于某个地方,请提前抱歉......假设我有一个带有列的 df1:series_id 和 lesion_name,我想通过用 series_id 和 lung_ref_seg 列替换 df1 来获得一个 df2。 I can have several lesion_name for the same series_id in df1 (left on the picture).我可以在 df1 中为同一个 series_id 设置多个 lesion_name(左图)。 In df2 (right on the picture) I would like to have only one lesion_name for one series_id original df1 and df2 wanted .在 df2 (图片右侧)中,我只想为一个 series_id原始 df1 和 df2 想要一个 lesion_name 。 If one series_id has a corresponding lesion_name which has 'lung' and 'PredCorr' I will take this as a first choice for lung_ref_seg.如果一个 series_id 有一个对应的 lesion_name,它有“lung”和“PredCorr”,我将把它作为lung_ref_seg 的首选。 If lesion_name doesn't have 'lung' and 'PredCorr' but has 'lung' and 'From' I will take this as a second choice for lung_ref_seg.如果 lesion_name 没有 'lung' 和 'PredCorr' 但有 'lung' 和 'From' 我将把它作为 lung_ref_seg 的第二选择。 And if lesion_name doesn't have any of the two first choices I will take lesion_name which has 'Pred' and 'lung' as a third choice for lung_ref_seg.如果 lesion_name 没有两个第一选择中的任何一个,我会将具有“Pred”和“lung”的 lesion_name 作为lung_ref_seg 的第三选择。 (lesion_name can have Nan values and I want to keep them in lung_ref_seg). (lesion_name 可以有 Nan 值,我想将它们保留在 lung_ref_seg 中)。 I've tried a lot of things (groupby, filter, str.contains, isin, lambda row...) so I'll just put one code that I think is close to the solution:我已经尝试了很多东西(groupby、filter、str.contains、isin、lambda 行......)所以我只放一个我认为接近解决方案的代码:
lesion_name = test['lesion_name']
series_id = test['series_id']
def LungSegRef(lesion_name):
for rows in series_id:
if 'PredCorr' in lesion_name and 'lung' in lesion_name:
return lesion_name
elif 'PredCorr' not in lesion_name and 'From' in lesion_name and 'lung' in lesion_name:
return lesion_name
elif 'PredCorr' not in lesion_name and 'From' not in lesion_name and 'Pred' in lesion_name and 'lung' in lesion_name:
return lesion_name
return ''
# Apply the function RefLesionName
test['lung_ref_seg'] = test['lesion_name'].apply(LungSegRef)
With this I don't have errors, I just have Nan values in the whole column lung_ref_seg and I still have multiple same series_id values.有了这个,我没有错误,我在整个 lung_ref_seg 列中只有 Nan 值,并且我仍然有多个相同的 series_id 值。 So I guess we could use groupby("series_id") somewhere and maybe my argument in the function is wrong.所以我想我们可以在某处使用 groupby("series_id") ,也许我在 function 中的论点是错误的。 Thank you very much for your help !非常感谢您的帮助 !
The rows are not duplicates.行不重复。 I found something that should work but I'm struggling to differentiate values that contains Pred and PredCorr because when I use contains it doesn't make a difference between rows containing only Pred and rows containing PredCorr.我发现了一些应该起作用的东西,但我正在努力区分包含 Pred 和 PredCorr 的值,因为当我使用 contains 时,它不会在仅包含 Pred 的行和包含 PredCorr 的行之间产生区别。 With this code I can't use startswith() and endswith().使用此代码,我不能使用startswith() 和endswith()。 I'm trying to find answers with regex but for now I didn't find anything to differentiate rows that contains only Pred and lung compared to rows that contains PredCorr and lung.我正在尝试使用正则表达式找到答案,但目前我没有找到任何东西来区分仅包含 Pred 和肺的行与包含 PredCorr 和肺的行。
def select_row2(row2):
if row2.lesion_name.str.contains("Pred" and "lung" and "Corr" and "From",na=True).any():
return row2[row2.lesion_name.str.contains("Corr" and "lung",na=True)]
elif row2.lesion_name.str.contains("Pred" and "lung" and "Corr",na=True).any():
return row2[row2.lesion_name.str.contains("Corr" and "lung",na=True)]
elif row2.lesion_name.str.contains("lung" and "Corr" and "From",na=True).any():
return row2[row2.lesion_name.str.contains("Corr" and "lung",na=True)]
elif row2.lesion_name.str.contains("Pred" and "lung" and "From",na=True).any():
return row2[row2.lesion_name.str.contains("From" and "lung",na=True)]
elif row2.lesion_name.str.contains("Pred" and "lung",na=True).any():
return row2[row2.lesion_name.str.contains("Pred" and "lung",na=True)]
elif row2.lesion_name.str.contains("lung" and "Corr",na=True).any():
return row2[row2.lesion_name.str.contains("lung" and "Corr",na=True)]
elif row2.lesion_name.str.contains("lung" and "From",na=True).any():
return row2[row2.lesion_name.str.contains("lung" and "From",na=True)]
else:
return None
test = test.groupby("series_id").apply(select_row2).reset_index(drop=True)
I think you can approach this as a two step process:我认为您可以将其作为一个两步过程来处理:
As you mention, filtering using the df.filter function (though you can also do it as df['lung_ref_seg'] = df[df['lung_ref_seg'].str.contains('STRING TO KEEP')]
.正如您所提到的,使用 df.filter function 进行过滤(尽管您也可以将其作为df['lung_ref_seg'] = df[df['lung_ref_seg'].str.contains('STRING TO KEEP')]
进行。
Dropping duplicates can be done with the df.drop_duplicates(subset=['series_id'])
可以使用df.drop_duplicates(subset=['series_id'])
删除重复项
I found an answer that seems to work for now !我找到了一个现在似乎可行的答案!
# We remove all lesion_name that contains string "tum" to work on creating the column lung_ref_seg (and we keep nan values)
test = test[~test.lesion_name.str.contains("tum",na=False)]
# Define the function to pick one lesion_name for one series_id prioritizing PredCorr first then From and then Pred for the last choice
def LungRefLesionName(row):
if row.lesion_name.str.contains("(Pred)\w+" or "From" or "(Pred)\b", na=True).any():
return row[row.lesion_name.str.contains("(Pred)\w+", na=True)]
elif row.lesion_name.str.contains("From" or "(Pred)\b", na=True).any():
return row[row.lesion_name.str.contains("From", na=True)]
elif row.lesion_name.str.contains("(Pred)\b" and "lung", na=True).any():
return row[row.lesion_name.str.contains("(Pred)\b" and "lung", na=True)]
# Apply the function
test = test.groupby("series_id").apply(LungRefLesionName).reset_index(drop=True)
# Drop columns that we don't need anymore : segmentation_id, lesion_id, series_id and study_id
test = test.drop(['segmentation_id', 'lesion_id', 'series_id', 'study_id'], axis = 1)
# Renaming column lesion_name by lung_ref_lesion_name
test = test.rename(columns={"lesion_name": "lung_ref_lesion_name"})
I'll have modification to do with lesion_name containing 'tum' later so I guess I'll have to change some things but for now this code works for manipulate strings with 'lung' !稍后我将对包含 'tum' 的 lesion_name 进行修改,所以我想我必须更改一些东西,但现在这段代码适用于使用 'lung' 操作字符串!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.