简体   繁体   English

在python中使用正则表达式查找字符串

[英]Using regex in python to find a string

I'm trying to find a substring of a string s, starting with {{Infobox and ending with }} . 我试图找到一个字符串s的子字符串,从{{Infobox}}结束。 I tried doing this with a regular expression, but it doesn't get any results. 我尝试使用正则表达式执行此操作,但未得到任何结果。 I think the fault is in my regular expression, but since I'm quitte new to regex, I hope someone can help with this. 我认为错误在于我的正则表达式,但是由于我对regex不熟悉,所以我希望有人可以对此提供帮助。 String s is for example: 字符串s例如:

s = '{{blabla}}{{Infobox persoon Tweede Wereldoorlog| naam=Albert Speer| afbeelding=Albert Speer Neurenberg.JPG}}{{blabla}}'

result = re.search('(.*)\{\{Infobox (.*)\}\}(.*)', s)
if result:
    print(result.group(2))

You can use lazy dot matching since your delimiters are not one-symbol delimiters, and capture what you need into group 1: 由于定界符不是一个符号的定界符,因此可以使用惰性点匹配 ,并将所需的信息捕获到第1组中:

import re
p = re.compile(r'\{\{Infobox\s*(.*?)}}')
test_str = "{{blabla}}{{Infobox persoon Tweede Wereldoorlog| naam=Albert Speer| afbeelding=Albert Speer Neurenberg.JPG}}{{blabla}}"
match = p.search(test_str)
if match:
    print(match.group(1))

See IDEONE demo IDEONE演示

If you use a negated character class, any { or } inside the Infobox will prevent from matching the whole substring. 如果使用否定的字符类,则信息框内的任何{}都将阻止匹配整个子字符串。

Also, since you do not seem to need the substrings before and after the substring you need, you do not need to match (or capture) them at all (thus, I removed them). 另外,由于您似乎不需要子字符串的前后,因此根本不需要匹配(或捕获)它们(因此,我删除了它们)。

Code: 码:

import re
s = '{{blabla}}{{Infobox persoon Tweede Wereldoorlog| naam=Albert Speer| afbeelding=Albert Speer Neurenberg.JPG}}{{blabla}}'

result = re.search(r'(.*){{Infobox ([^}]*?)}}(.*)', s)
if result:
    print(result.group(2))

Output: 输出:

persoon Tweede Wereldoorlog| naam=Albert Speer| afbeelding=Albert Speer Neurenberg.JPG

NOTE : The above regex will match till it meets the first } after {{Infobox . 注意 :上面的正则表达式将匹配,直到遇到{{Infobox }之后的第一个}

Important note: 重要的提示:

This will only work for the cases like the given sample input 这仅适用于给定样本输入的情况

It will not work if the input has a } in between ie){{blabla}}{{Infobox persoon Tweede Wereldoorlog| naam=Albert Speer| }afbeelding=Albert Speer Neurenberg.JPG}}{{blabla}} 如果输入之间有一个}ie){{blabla}}{{Infobox persoon Tweede Wereldoorlog| naam=Albert Speer| }afbeelding=Albert Speer Neurenberg.JPG}}{{blabla}} ie){{blabla}}{{Infobox persoon Tweede Wereldoorlog| naam=Albert Speer| }afbeelding=Albert Speer Neurenberg.JPG}}{{blabla}} ie){{blabla}}{{Infobox persoon Tweede Wereldoorlog| naam=Albert Speer| }afbeelding=Albert Speer Neurenberg.JPG}}{{blabla}} For cases like that stribizhev's answer is the best solution ie){{blabla}}{{Infobox persoon Tweede Wereldoorlog| naam=Albert Speer| }afbeelding=Albert Speer Neurenberg.JPG}}{{blabla}} 对于这种情况,stribizhev的回答是最好的解决方案

s = '{{blabla}}{{Infobox persoon Tweede Wereldoorlog| naam=Albert Speer| afbeelding=Albert Speer Neurenberg.JPG}}{{blabla}}'

# start with Infobox and two chars before, grab everything but '}', followed by two chars
mo = re.search(r'(..Infobox[^}]*..)',s)


print(mo.group(1))


# {{Infobox persoon Tweede Wereldoorlog| naam=Albert Speer| afbeelding=Albert Speer Neurenberg.JPG}}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM