[英]How do I extract with regex all the text (numbers, letters, symbols) after the second capital letter?
They won. Elles gagnèrent.
They won. Ils ont gagné.
They won. Elles ont gagné.
Tom came. Tom est venu.
Tom died. Tom est mort.
Tom knew. Tom savait.
Tom left. Tom est parti.
Tom left. Tom partit.
Tom lied. Tom a menti.
Tom lies. Tom ment.
Tom lost. Tom a perdu.
Tom paid. Tom a payé.
I'm having some trouble putting together a regex pattern that extracts all the text after the second capital letter (including it).我在将提取第二个大写字母(包括它)之后的所有文本的正则表达式模式放在一起时遇到了一些麻烦。
For example:例如:
They won. Elles gagnèrent.
in this case you should extract:在这种情况下,您应该提取:
Elles gagnèrent.
This is my code, but it is not working well:这是我的代码,但效果不佳:
import re
line = "They won. Elles gagnèrent." #for example this case
match = re.search(r"\s¿?(?:A|Á|B|C|D|E|É|F|G|H|I|Í|J|K|LL|L|M|N|Ñ|O|Ó|P|Q|R|S|T|U|Ú|V|W|X|Y|Z)\s((?:\w\s)+)?" , line)
n_sense = match.group()
print(repr(n_sense)) #should print "Elles gagnèrent."
You may try the following codes.您可以尝试以下代码。
with open(file, "r") as r:
for line in r:
line = re.sub('^[^A-Z]*[A-Z][^A-Z]*','', line)
print(line, end="")
Here it goes the regex: [^AZ]*[AZ][^AZ]*[AZ]([^\n]+)
这是正则表达式:
[^AZ]*[AZ][^AZ]*[AZ]([^\n]+)
The parenthesis wraps the text you want, which is called group.括号把你想要的文字包裹起来,这叫做group。 YOu will find out about it and how it works in python easily.
您将在 python 中轻松了解它及其工作原理。
But the better this is providing a tool https://regex101.com/但更好的是提供一个工具https://regex101.com/
You can search for the match as you describe it:您可以按照描述搜索匹配项:
[A-Z].*?([A-Z].*)
That's an uppercase letter, followed by zero or more of anything, followed by another uppercase followed by anything, capturing the last group:这是一个大写字母,后跟零个或多个任何内容,然后是另一个大写字母,然后是任何内容,捕获最后一组:
import unicodedata
import re
s = '''They won. Elles gagnèrent.
They won. Ils ont gagné.
They won. Elles ont gagné.
Tom came. Tom est venu.
Tom died. Tom est mort.
Tom knew. Tom savait.
Tom left. Tom est parti.
Tom left. Tom partit.
Tom lied. Tom a menti.
Tom lies. Tom ment.
Âom lost. Étienne a perdu. # << note accents
Tom paid. Tom a payé.'''
s = unicodedata.normalize('NFD', s)
re.findall(r'[A-Z].*?([A-Z].*)', s, re.UNICODE)
Which will give you:这会给你:
['Elles gagnèrent.',
'Ils ont gagné.',
'Elles ont gagné.',
'Tom est venu.',
'Tom est mort.',
'Tom savait.',
'Tom est parti.',
'Tom partit.',
'Tom a menti.',
'Tom ment.',
'Étienne a perdu.',
'Tom a payé.']
If all those spaces are part of the actual text, may be easier to match those or split.如果所有这些空格都是实际文本的一部分,则可能更容易匹配或拆分这些空格。 The
re.UNICODE
flag will allow it to match uppercase letters with accents like Étienne
, but you need to make sure the unicode is normalized first. re.UNICODE
标志将允许它匹配带有重音符号的大写字母,例如Étienne
,但您需要确保首先对 unicode 进行标准化。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.