简体   繁体   English

正则表达式从 pandas DataFrame 列中提取 substring

[英]Regex to extract substring from pandas DataFrame column

I have following column in a DataFrame.我在 DataFrame 中有以下专栏。

col1
['SNOMEDCT_US:32113001', 'UMLS:C0265660']
['UMLS:C2674738', 'UMLS:C2674739']
['UMLS:C1290857', 'SNOMEDCT_US:118930001', 'UMLS:C123455']

I would like extract the value after UMLS: and store it in another column.我想在 UMLS: 之后提取值并将其存储在另一列中。 I am trying following lines of code but I am not getting the expected output.我正在尝试执行以下代码行,但没有得到预期的 output。

df['col1'].str.extract(r'\['.*UMLS:(.*)]')

The expected output is:预期的 output 是:

col1                                                            col2
['SNOMEDCT_US:32113001', 'UMLS:C0265660']                       C0265660
['UMLS:C2674738', 'UMLS:C2674739']                              C2674738, C2674739
['UMLS:C1290857', 'SNOMEDCT_US:118930001', 'UMLS:C123455']      C1290857, C123455

You can use DataFrame.explode to turn the rows of lists into rows of individual strings.您可以使用DataFrame.explode将列表行转换为单个字符串行。 Then, you can use Series.str.extract to match the desired regular expression.然后,您可以使用Series.str.extract来匹配所需的正则表达式。 Finally, you can use DataFrame.groupby and DataFrame.agg to turn col1 back into its original form with col2 as desired:最后,您可以根据需要使用DataFrame.groupbyDataFrame.aggcol1变回其与col2的原始形式:

df = df.explode("col1")
df["col2"] = df["col1"].str.extract(r"UMLS:(.+)")
df = df.groupby(level=0).agg({
    "col1": list,
    "col2": lambda x: ", ".join(item for item in x if item == item)
})

This outputs:这输出:

                                                col1                col2
0              [SNOMEDCT_US:32113001, UMLS:C0265660]            C0265660
1                     [UMLS:C2674738, UMLS:C2674739]  C2674738, C2674739
2  [UMLS:C1290857, SNOMEDCT_US:118930001, UMLS:C1...   C1290857, C123455

I used a different re that I tested at https://regex101.com/我使用了我在https://regex101.com/测试过的不同 re

UMLS:(\w*)

With the following command, I got a new column with the data formatted as you desired:使用以下命令,我得到了一个新列,其中的数据格式符合您的要求:

df['new'] = df['input'].apply(lambda x: re.findall(r"UMLS:(\w*)",x)).apply(lambda x: ','.join(map(str,x)))

The first.apply() function is based on this answer . first.apply() function 基于这个答案 The findall function returns a list ([C2674738, C2674739]). findall function 返回一个列表 ([C2674738, C2674739])。

Since you want a string with as many matches as are found, the second apply() function (based on this answer ) will convert the list into a comma delimited string.由于您想要一个包含与找到的匹配项一样多的字符串,因此第二个 apply() function(基于此答案)会将列表转换为逗号分隔的字符串。

I hope there is a more elegant answer:-)我希望有一个更优雅的答案:-)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM