简体   繁体   English

如何使用正则表达式提取第二个大写字母后的所有文本(数字、字母、符号)?

[英]How do I extract with regex all the text (numbers, letters, symbols) after the second capital letter?

They won.             Elles gagnèrent.
They won.    Ils ont gagné.
They won.        Elles ont gagné.
Tom came.    Tom est venu.
Tom died.       Tom est mort.
Tom knew. Tom savait.
Tom left.    Tom est parti.
Tom left.       Tom partit.
Tom lied. Tom a menti.
Tom lies.    Tom ment.
Tom lost.            Tom a perdu.
Tom paid.    Tom a payé.

I'm having some trouble putting together a regex pattern that extracts all the text after the second capital letter (including it).我在将提取第二个大写字母(包括它)之后的所有文本的正则表达式模式放在一起时遇到了一些麻烦。

For example:例如:

They won.             Elles gagnèrent.

in this case you should extract:在这种情况下,您应该提取:

Elles gagnèrent.

This is my code, but it is not working well:这是我的代码,但效果不佳:

import re

line = "They won.             Elles gagnèrent." #for example this case

match = re.search(r"\s¿?(?:A|Á|B|C|D|E|É|F|G|H|I|Í|J|K|LL|L|M|N|Ñ|O|Ó|P|Q|R|S|T|U|Ú|V|W|X|Y|Z)\s((?:\w\s)+)?" , line)

n_sense = match.group()

print(repr(n_sense)) #should print "Elles gagnèrent."

You may try the following codes.您可以尝试以下代码。

with open(file, "r") as r:
    for line in r:
        line = re.sub('^[^A-Z]*[A-Z][^A-Z]*','', line)
        print(line, end="")

Here it goes the regex: [^AZ]*[AZ][^AZ]*[AZ]([^\n]+)这是正则表达式: [^AZ]*[AZ][^AZ]*[AZ]([^\n]+)

The parenthesis wraps the text you want, which is called group.括号把你想要的文字包裹起来,这叫做group。 YOu will find out about it and how it works in python easily.您将在 python 中轻松了解它及其工作原理。

But the better this is providing a tool https://regex101.com/但更好的是提供一个工具https://regex101.com/

You can search for the match as you describe it:您可以按照描述搜索匹配项:

[A-Z].*?([A-Z].*)

That's an uppercase letter, followed by zero or more of anything, followed by another uppercase followed by anything, capturing the last group:这是一个大写字母,后跟零个或多个任何内容,然后是另一个大写字母,然后是任何内容,捕获最后一组:

import unicodedata
import re

s = '''They won.             Elles gagnèrent.
They won.    Ils ont gagné.
They won.        Elles ont gagné.
Tom came.    Tom est venu.
Tom died.       Tom est mort.
Tom knew. Tom savait.
Tom left.    Tom est parti.
Tom left.       Tom partit.
Tom lied. Tom a menti.
Tom lies.    Tom ment.
Âom lost.            Étienne a perdu.  # << note accents
Tom paid.    Tom a payé.'''


s = unicodedata.normalize('NFD', s)
re.findall(r'[A-Z].*?([A-Z].*)', s, re.UNICODE)

Which will give you:这会给你:

['Elles gagnèrent.',
 'Ils ont gagné.',
 'Elles ont gagné.',
 'Tom est venu.',
 'Tom est mort.',
 'Tom savait.',
 'Tom est parti.',
 'Tom partit.',
 'Tom a menti.',
 'Tom ment.',
 'Étienne a perdu.',
 'Tom a payé.']

If all those spaces are part of the actual text, may be easier to match those or split.如果所有这些空格都是实际文本的一部分,则可能更容易匹配或拆分这些空格。 The re.UNICODE flag will allow it to match uppercase letters with accents like Étienne , but you need to make sure the unicode is normalized first. re.UNICODE标志将允许它匹配带有重音符号的大写字母,例如Étienne ,但您需要确保首先对 unicode 进行标准化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 正则表达式在观察到 4 个连续的大写字母后捕获所有文本 - Regex to capture all text after 4 consecutive capital letters observed 如何在当前的正则表达式中添加大写和小写字母? - How do I add capital and lowercase letters to my current regex? 正则表达式用于匹配大写字母和数字 - RegEx for matching capital letters and numbers 如何使用正则表达式缩写所有以大写字母开头的单词 - How can I use Regex to abbreviate words that all start with a capital letter 如何通过映射每个大写字母仅提取括号内首字母缩略词后的缩写 - How do i extract only abbreviation following acronyms inside the brackets by mapping each Capital letter 如何使用正则表达式在 Panda 中返回字符串中的所有大写字母 - How to return all capital letters within string in Panda using Regex 如何仅使用正则表达式提取数字? - How do I extract numbers only with regex? Regex , 找到句子,都是大写字母 - Regex , Find the sentence, all of which are capital letters 当且仅当前一个字母不是大写字母时,才如何在大写字母前插入空格? - How do I insert space before capital letter if and only if previous letter is not capital? Python Regex - 检查大写字母后面的大写字母 - Python Regex - checking for a capital letter with a lowercase after
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM