简体   繁体   English

如何使用 Pandas Dataframe 对多个字符串进行分组和过滤?

[英]how group by and filter multiple strings with Pandas Dataframe?

I'm a beginner for coding and I've tried to look for answers for a few days but I didn't succeed what I want to do so sorry in advance if it's easy or if it already exists somewhere... Let's say I have a df1 with columns: series_id and lesion_name and I would like to obtain a df2 by replacing the df1 with series_id and lung_ref_seg columns.我是编码的初学者,几天来我一直试图寻找答案,但我没有成功我想做的事,如果它很容易或者它已经存在于某个地方,请提前抱歉......假设我有一个带有列的 df1:series_id 和 lesion_name,我想通过用 series_id 和 lung_ref_seg 列替换 df1 来获得一个 df2。 I can have several lesion_name for the same series_id in df1 (left on the picture).我可以在 df1 中为同一个 series_id 设置多个 lesion_name(左图)。 In df2 (right on the picture) I would like to have only one lesion_name for one series_id original df1 and df2 wanted .在 df2 (图片右侧)中,我只想为一个 series_id原始 df1 和 df2 想要一个 lesion_name 。 If one series_id has a corresponding lesion_name which has 'lung' and 'PredCorr' I will take this as a first choice for lung_ref_seg.如果一个 series_id 有一个对应的 lesion_name,它有“lung”和“PredCorr”,我将把它作为lung_ref_seg 的首选。 If lesion_name doesn't have 'lung' and 'PredCorr' but has 'lung' and 'From' I will take this as a second choice for lung_ref_seg.如果 lesion_name 没有 'lung' 和 'PredCorr' 但有 'lung' 和 'From' 我将把它作为 lung_ref_seg 的第二选择。 And if lesion_name doesn't have any of the two first choices I will take lesion_name which has 'Pred' and 'lung' as a third choice for lung_ref_seg.如果 lesion_name 没有两个第一选择中的任何一个,我会将具有“Pred”和“lung”的 lesion_name 作为lung_ref_seg 的第三选择。 (lesion_name can have Nan values and I want to keep them in lung_ref_seg). (lesion_name 可以有 Nan 值,我想将它们保留在 lung_ref_seg 中)。 I've tried a lot of things (groupby, filter, str.contains, isin, lambda row...) so I'll just put one code that I think is close to the solution:我已经尝试了很多东西(groupby、filter、str.contains、isin、lambda 行......)所以我只放一个我认为接近解决方案的代码:

lesion_name = test['lesion_name']
series_id = test['series_id']

def LungSegRef(lesion_name):
    for rows in series_id:
        if 'PredCorr' in lesion_name and 'lung' in lesion_name:
            return lesion_name
        elif 'PredCorr' not in lesion_name and 'From' in lesion_name and 'lung' in lesion_name:
            return lesion_name
        elif 'PredCorr' not in lesion_name and 'From' not in lesion_name and 'Pred' in lesion_name and 'lung' in lesion_name:
            return lesion_name
    return ''


# Apply the function RefLesionName 
test['lung_ref_seg'] = test['lesion_name'].apply(LungSegRef)

With this I don't have errors, I just have Nan values in the whole column lung_ref_seg and I still have multiple same series_id values.有了这个,我没有错误,我在整个 lung_ref_seg 列中只有 Nan 值,并且我仍然有多个相同的 series_id 值。 So I guess we could use groupby("series_id") somewhere and maybe my argument in the function is wrong.所以我想我们可以在某处使用 groupby("series_id") ,也许我在 function 中的论点是错误的。 Thank you very much for your help !非常感谢您的帮助 !

The rows are not duplicates.行不重复。 I found something that should work but I'm struggling to differentiate values that contains Pred and PredCorr because when I use contains it doesn't make a difference between rows containing only Pred and rows containing PredCorr.我发现了一些应该起作用的东西,但我正在努力区分包含 Pred 和 PredCorr 的值,因为当我使用 contains 时,它不会在仅包含 Pred 的行和包含 PredCorr 的行之间产生区别。 With this code I can't use startswith() and endswith().使用此代码,我不能使用startswith() 和endswith()。 I'm trying to find answers with regex but for now I didn't find anything to differentiate rows that contains only Pred and lung compared to rows that contains PredCorr and lung.我正在尝试使用正则表达式找到答案,但目前我没有找到任何东西来区分仅包含 Pred 和肺的行与包含 PredCorr 和肺的行。

 def select_row2(row2):
    if row2.lesion_name.str.contains("Pred" and "lung" and "Corr" and "From",na=True).any():
        return row2[row2.lesion_name.str.contains("Corr" and "lung",na=True)]
       
    elif row2.lesion_name.str.contains("Pred" and "lung" and "Corr",na=True).any():
        return row2[row2.lesion_name.str.contains("Corr" and "lung",na=True)]
    
    elif row2.lesion_name.str.contains("lung" and "Corr" and "From",na=True).any():
        return row2[row2.lesion_name.str.contains("Corr" and "lung",na=True)]
    
    elif row2.lesion_name.str.contains("Pred" and "lung" and "From",na=True).any():
        return row2[row2.lesion_name.str.contains("From" and "lung",na=True)]
    
    elif row2.lesion_name.str.contains("Pred" and "lung",na=True).any():
        return row2[row2.lesion_name.str.contains("Pred" and "lung",na=True)]
    
    elif row2.lesion_name.str.contains("lung" and "Corr",na=True).any():
        return row2[row2.lesion_name.str.contains("lung" and "Corr",na=True)]
    
    elif row2.lesion_name.str.contains("lung" and "From",na=True).any():
        return row2[row2.lesion_name.str.contains("lung" and "From",na=True)]
    
    else:
        return None 

test = test.groupby("series_id").apply(select_row2).reset_index(drop=True) 

I think you can approach this as a two step process:我认为您可以将其作为一个两步过程来处理:

  1. First, filter down to the data that you want to keep (it seems like specific values containing specific strings are the one you want - but I'm a bit confused by your post here).首先,过滤到您想要保留的数据(似乎包含特定字符串的特定值是您想要的 - 但我对您的帖子有点困惑)。
  2. Second, drop duplicates from the series_id column.其次,从 series_id 列中删除重复项。 This will result in you just having one value for each of these.这将导致您对其中的每一个都只有一个值。

As you mention, filtering using the df.filter function (though you can also do it as df['lung_ref_seg'] = df[df['lung_ref_seg'].str.contains('STRING TO KEEP')] .正如您所提到的,使用 df.filter function 进行过滤(尽管您也可以将其作为df['lung_ref_seg'] = df[df['lung_ref_seg'].str.contains('STRING TO KEEP')]进行。

Dropping duplicates can be done with the df.drop_duplicates(subset=['series_id'])可以使用df.drop_duplicates(subset=['series_id'])删除重复项

I found an answer that seems to work for now !我找到了一个现在似乎可行的答案!

# We remove all lesion_name that contains string "tum" to work on creating the column lung_ref_seg (and we keep nan values)

test = test[~test.lesion_name.str.contains("tum",na=False)]

# Define the function to pick one lesion_name for one series_id prioritizing PredCorr first then From and then Pred for the last choice

def LungRefLesionName(row):
    if row.lesion_name.str.contains("(Pred)\w+" or "From" or "(Pred)\b", na=True).any():
        return row[row.lesion_name.str.contains("(Pred)\w+", na=True)]
        
    elif row.lesion_name.str.contains("From" or "(Pred)\b", na=True).any(): 
        return row[row.lesion_name.str.contains("From", na=True)] 
    
    elif row.lesion_name.str.contains("(Pred)\b" and "lung", na=True).any():
        return row[row.lesion_name.str.contains("(Pred)\b" and "lung", na=True)]

# Apply the function
test = test.groupby("series_id").apply(LungRefLesionName).reset_index(drop=True)


# Drop columns that we don't need anymore : segmentation_id, lesion_id, series_id and study_id

test = test.drop(['segmentation_id', 'lesion_id', 'series_id', 'study_id'], axis = 1)

# Renaming column lesion_name by lung_ref_lesion_name

test = test.rename(columns={"lesion_name": "lung_ref_lesion_name"})

I'll have modification to do with lesion_name containing 'tum' later so I guess I'll have to change some things but for now this code works for manipulate strings with 'lung' !稍后我将对包含 'tum' 的 lesion_name 进行修改,所以我想我必须更改一些东西,但现在这段代码适用于使用 'lung' 操作字符串!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM