[英]Regex to extract substring from pandas DataFrame column
I have following column in a DataFrame.我在 DataFrame 中有以下专栏。
col1
['SNOMEDCT_US:32113001', 'UMLS:C0265660']
['UMLS:C2674738', 'UMLS:C2674739']
['UMLS:C1290857', 'SNOMEDCT_US:118930001', 'UMLS:C123455']
I would like extract the value after UMLS: and store it in another column.我想在 UMLS: 之后提取值并将其存储在另一列中。 I am trying following lines of code but I am not getting the expected output.我正在尝试执行以下代码行,但没有得到预期的 output。
df['col1'].str.extract(r'\['.*UMLS:(.*)]')
The expected output is:预期的 output 是:
col1 col2
['SNOMEDCT_US:32113001', 'UMLS:C0265660'] C0265660
['UMLS:C2674738', 'UMLS:C2674739'] C2674738, C2674739
['UMLS:C1290857', 'SNOMEDCT_US:118930001', 'UMLS:C123455'] C1290857, C123455
You can use DataFrame.explode
to turn the rows of lists into rows of individual strings.您可以使用DataFrame.explode
将列表行转换为单个字符串行。 Then, you can use Series.str.extract
to match the desired regular expression.然后,您可以使用Series.str.extract
来匹配所需的正则表达式。 Finally, you can use DataFrame.groupby
and DataFrame.agg
to turn col1
back into its original form with col2
as desired:最后,您可以根据需要使用DataFrame.groupby
和DataFrame.agg
将col1
变回其与col2
的原始形式:
df = df.explode("col1")
df["col2"] = df["col1"].str.extract(r"UMLS:(.+)")
df = df.groupby(level=0).agg({
"col1": list,
"col2": lambda x: ", ".join(item for item in x if item == item)
})
This outputs:这输出:
col1 col2
0 [SNOMEDCT_US:32113001, UMLS:C0265660] C0265660
1 [UMLS:C2674738, UMLS:C2674739] C2674738, C2674739
2 [UMLS:C1290857, SNOMEDCT_US:118930001, UMLS:C1... C1290857, C123455
I used a different re that I tested at https://regex101.com/我使用了我在https://regex101.com/测试过的不同 re
UMLS:(\w*)
With the following command, I got a new column with the data formatted as you desired:使用以下命令,我得到了一个新列,其中的数据格式符合您的要求:
df['new'] = df['input'].apply(lambda x: re.findall(r"UMLS:(\w*)",x)).apply(lambda x: ','.join(map(str,x)))
The first.apply() function is based on this answer . first.apply() function 基于这个答案。 The findall function returns a list ([C2674738, C2674739]). findall function 返回一个列表 ([C2674738, C2674739])。
Since you want a string with as many matches as are found, the second apply() function (based on this answer ) will convert the list into a comma delimited string.由于您想要一个包含与找到的匹配项一样多的字符串,因此第二个 apply() function(基于此答案)会将列表转换为逗号分隔的字符串。
I hope there is a more elegant answer:-)我希望有一个更优雅的答案:-)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.