简体   繁体   English

正则表达式匹配某些字符

[英]regex to match between certain characters

I have strings like this... 我有这样的字符串......

"1. yada yada yada (This is a string; "This is a thing")
 2. blah blah blah (This is also a string)"

I want to return... 我想回来......

['this is a string', 'this is also a string']

so it should match everything between the '(' and ';' or between '(' and ')' 所以它应匹配'('和';'之间或'''和'''之间的所有内容

this is what I have so far in python matches the sections I want, but I can't figure out how to cut them down to return what I really want inside them... 这是我到目前为止在python匹配我想要的部分,但我无法弄清楚如何削减它们以返回我真正想要的内容......

pattern = re.compile('\([a-zAZ ;"]+\)|\([a-zAZ ]+\)')
re.findall(pattern)

it returns this... 它返回这个......

['(This is a string; "This is a thing"), '(This is also a string)']

EDIT ADDED FOR MORE INFO: 编辑增加了更多信息:

I realized there is more parenthesis above the numebred text sections that I want to omit.... 我意识到在我想要省略的数字文本部分之上有更多的括号....

"some text and stuff (some more info)
 1. yada yada yada (This is a string; "This is a thing")
 2. blah blah blah (This is also a string)"

I don't want to match "(some more info)" but I am not sure how to only include the text after the numbers (ex. 1. lskdfjlsdjfds(string I want)) 我不想匹配“(更多信息)”但我不确定如何只在数字后面包含文本(例如1. lskdfjlsdjfds(我想要的字符串))

You can use 您可以使用

\(([^);]+)

The regex demo is available here . 正则表达式演示可在此处获得

Note the capturing group I set with the help of unescaped parentheses: the value captured with this subpattern is returned by the re.findall method , not the whole match. 请注意我在非转义括号的帮助下设置的捕获组:使用此子模式捕获的值由re.findall方法返回,而不是整个匹配。

It matches 它匹配

  • \\( - a literal ( \\( - 文字(
  • ([^);]+) - matches and captures 1 or more characters other than ) or ; ([^);]+) -比赛和捕捉比其他1个或多个字符);

Python demo : Python演示

import re
p = re.compile(r'\(([^);]+)')
test_str = "1. yada yada yada (This is a string; \"This is a thing\")\n2. blah blah blah (This is also a string)"
print(p.findall(test_str)) # => ['This is a string', 'This is also a string']

I would suggest 我会建议

^[^\(]*\(([^;\)]+)

Splitting it into parts: 将其拆分为多个部分:

# ^         - start of string
# [^\(]*    - everything that's not an opening bracket
# \(        - opening bracket
# ([^;\)]+) - capture everything that's not semicolon or closing bracket

Unless of course you wish to impose (or drop) some requirements on "blah blah blah" part. 除非你当然希望对“等等等等”部分强加(或放弃)一些要求。

You can drop the first two parts, but then it will match some things it probably shouldn't... or maybe it should. 你可以删除前两个部分,但它会匹配一些它可能不应该的东西......或者它应该。 It all depends on what your objectives are. 这一切都取决于你的目标是什么。

PS Missed that you want to find all instances. PS错过了你想要找到所有实例。 So multiline flag needs to be set: 因此需要设置多行标志:

pattern = re.compile(r'^[^\(]*\(([^;\)]+)', re.MULTILINE)
matches = pattern.findall(string_to_search)

It is important to check for beginning of the line, because your input can be: 检查行的开头很重要,因为您的输入可以是:

"""1. yada yada yada (This is a string; "This is a (thing)")
2. blah blah blah (This is also a string)"""

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM