简体   繁体   English

在python中使用正则表达式在冒号或括号之前提取字符串

[英]Extract string before colon or parenthesis with regex in python

I'm trying to extract the string muscle pain from the following strings. 我正在尝试从以下琴弦中提取琴弦muscle pain I need to use a regular expression that works for all three cases. 我需要使用适用于所有三种情况的正则表达式。

string1 = 'A1 muscle pain: immunotherapy'
string2 = 'A2B_45 muscle pain: topical medicine e.g. ....'
string3 = 'A2_45 muscle pain (pain): topical medicine e.g. ....'

The following code works for string1 and string2 . 以下代码适用于string1string2 But it does not work for string3 . 但是它不适用于string3 What I get is always muscle pain (pain) . 我得到的总是muscle pain (pain) Can anyone help me with that. 谁能帮助我。 I tried so many times with different expression but could not figure out how. 我用不同的表情尝试了很多次,但不知道怎么做。

re.match(r"^[A-Z]+\d*[A-Z]*_?\d*\s(.*)[:\(]", string3).group(1)

You can shorten the expression to: 您可以将表达式缩短为:

^A\S+\s([^:(]*)(?=:|\s\()
  • ^A Assert position beginning of string. ^A字符串的起始位置。
  • \\S+ Any non whitespace characters. \\S+任何非空格字符。
  • \\s Whitespace character. \\s空格字符。
  • ([^:(]*) Capture group. Match and capture anything other than a ( bracket or ] bracket. ([^:(]*)捕获组。匹配并捕获除(括号或]括号以外的任何内容。
  • (?=:|\\s\\() Positive lookahead for : or whitespace followed by ( . (?=:|\\s\\()正向搜索:或空格,后跟(

Try it live here . 在这里试一试。


Python snippet: Python片段:

import re
string1 = 'A1 muscle pain: immunotherapy'
string2 = 'A2B_45 muscle pain: topical medicine e.g. ....'
string3 = 'A2_45 muscle pain (pain): topical medicine e.g. ....'

print(re.match(r'^A\S+\s([^:(]*)(?=:|\s\()',string3).group(1))

Try this pattern: ^[\\dA-Z_]+ ([^\\(:]+) . 尝试以下模式: ^[\\dA-Z_]+ ([^\\(:]+)

It starts with [\\dA-Z_]+ at the beggining (note anchor ^ ), followed by space. 它在开始时以[\\dA-Z_]+开头(请注意锚点^ ),然后是空格。 Now, start capturing group until one of unwanted characters is met: [^\\(:] . You can add there more "unwanted" characters to alter regex to match differently. 现在,开始捕获组,直到遇到不需要的字符之一: [^\\(:] 。您可以在其中添加更多“不需要的”字符来更改正则表达式以匹配不同的内容。

First capturing group is what you want. 第一个捕获组是您想要的。

Demo 演示版

You could try this pattern to remove space after third match: ^[\\dA-Z_]+ ([\\w ]+)(?=(:| \\()) . See demo. 您可以尝试在第三次匹配后使用此模式删除空间: ^[\\dA-Z_]+ ([\\w ]+)(?=(:| \\())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM