[英]Extract string before colon or parenthesis with regex in python
I'm trying to extract the string muscle pain
from the following strings. 我正在尝试从以下琴弦中提取琴弦
muscle pain
。 I need to use a regular expression that works for all three cases. 我需要使用适用于所有三种情况的正则表达式。
string1 = 'A1 muscle pain: immunotherapy'
string2 = 'A2B_45 muscle pain: topical medicine e.g. ....'
string3 = 'A2_45 muscle pain (pain): topical medicine e.g. ....'
The following code works for string1
and string2
. 以下代码适用于
string1
和string2
。 But it does not work for string3
. 但是它不适用于
string3
。 What I get is always muscle pain (pain)
. 我得到的总是
muscle pain (pain)
。 Can anyone help me with that. 谁能帮助我。 I tried so many times with different expression but could not figure out how.
我用不同的表情尝试了很多次,但不知道怎么做。
re.match(r"^[A-Z]+\d*[A-Z]*_?\d*\s(.*)[:\(]", string3).group(1)
You can shorten the expression to: 您可以将表达式缩短为:
^A\S+\s([^:(]*)(?=:|\s\()
^A
Assert position beginning of string. ^A
字符串的起始位置。 \\S+
Any non whitespace characters. \\S+
任何非空格字符。 \\s
Whitespace character. \\s
空格字符。 ([^:(]*)
Capture group. Match and capture anything other than a (
bracket or ]
bracket. ([^:(]*)
捕获组。匹配并捕获除(
括号或]
括号以外的任何内容。 (?=:|\\s\\()
Positive lookahead for :
or whitespace followed by (
. (?=:|\\s\\()
正向搜索:
或空格,后跟(
。 Python snippet: Python片段:
import re
string1 = 'A1 muscle pain: immunotherapy'
string2 = 'A2B_45 muscle pain: topical medicine e.g. ....'
string3 = 'A2_45 muscle pain (pain): topical medicine e.g. ....'
print(re.match(r'^A\S+\s([^:(]*)(?=:|\s\()',string3).group(1))
Try this pattern: ^[\\dA-Z_]+ ([^\\(:]+)
. 尝试以下模式:
^[\\dA-Z_]+ ([^\\(:]+)
。
It starts with [\\dA-Z_]+
at the beggining (note anchor ^
), followed by space. 它在开始时以
[\\dA-Z_]+
开头(请注意锚点^
),然后是空格。 Now, start capturing group until one of unwanted characters is met: [^\\(:]
. You can add there more "unwanted" characters to alter regex to match differently. 现在,开始捕获组,直到遇到不需要的字符之一:
[^\\(:]
。您可以在其中添加更多“不需要的”字符来更改正则表达式以匹配不同的内容。
First capturing group is what you want. 第一个捕获组是您想要的。
You could try this pattern to remove space after third match: ^[\\dA-Z_]+ ([\\w ]+)(?=(:| \\())
. See demo. 您可以尝试在第三次匹配后使用此模式删除空间:
^[\\dA-Z_]+ ([\\w ]+)(?=(:| \\())
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.