简体   繁体   English

Python中的正则表达式:如果没有另一个可变长度的单词,该如何匹配单词模式?

[英]Regex in Python: How to match a word pattern, if not preceded by another word of variable length?

I would like reconstruct full names from photo captions using Regex in Python, by appending last name back to the first name in patterns "FirstName1 and FirstName2 LastName" . 我想在Python中使用正则表达式从照片标题中重建全名,方法是在模式“ FirstName1和FirstName2 LastName”中将姓氏附加回名字 We can rely on names starting with capital letter. 我们可以使用以大写字母开头的名称。

For example, 例如,

'John and Albert McDonald' becomes 'John McDonald' and 'Albert McDonald' “约翰和阿尔伯特·麦克唐纳”成为“约翰·麦当劳”“阿尔伯特·麦克唐纳”

'Stephen Stewart, John and Albert Diamond' becomes 'John Diamond' and 'Albert Diamond' “斯蒂芬·斯图尔特,约翰和艾伯特·戴蒙德”成为“约翰·戴蒙德”“艾伯特·戴蒙德”

I would need to avoid matching patterns like this: 'Jay Smith and Albert Diamond' and generate a non-existent name 'Smith Diamond' 我需要避免匹配这样的模式: “ Jay Smith和Albert Diamond”,并生成一个不存在的名称“ Smith Diamond”

The photo captions may or may not have more words before this pattern, for example, 'It was a great day hanging out with John and Stephen Diamond.' 在此模式之前,图片说明可能没有,也可能没有更多的单词,例如, “这是与John和Stephen Diamond呆在一起的好日子。”

This is the code I have so far: 这是我到目前为止的代码:

s = 'John and Albert McDonald'
so = re.search('([A-Z][a-z\-]+)\sand\s([A-Z][a-z\-]+\s[A-Z][a-z\-]+(?:[A-Z][a-z]+)?)', s)        
if so:
    print so.group(1) + ' ' + so.group(2).split()[1]
    print so.group(2)

This returns 'John McDonald' and 'Albert McDonald' , but 'Jay Smith and Albert Diamond' will result in a non-existent name 'Smith Diamond' . 这将返回“ John McDonald”“ Albert McDonald” ,但是“ Jay Smith和Albert Diamond”将导致名称“ Smith Diamond”不存在。

An idea would be to check whether the pattern is preceded by a capitalized word, something like (?<![AZ][az\\-]+) \\s([AZ][az\\-]+)\\sand\\s([AZ][az\\-]+\\s[AZ][az\\-]+(?:[AZ][az]+)?) but unfortunately negative lookbehind only works if we know the exact length of the preceding word, which I don't. 一个想法是检查模式是否以大写字母开头,例如(?<![AZ][az\\-]+) \\s([AZ][az\\-]+)\\sand\\s([AZ][az\\-]+\\s[AZ][az\\-]+(?:[AZ][az]+)?)但遗憾的是,仅当我们知道前一个单词的确切长度时,负向后方查找才有效,我不知道

Could you please let me know how I can correct my regex epression? 您能否让我知道如何纠正我的正则表达式表达? Or is there a better way to do what I want? 还是有更好的方法来做我想要的? Thanks! 谢谢!

As you can rely on names starting with a capital letter, then you could do something like: 由于您可以依靠以大写字母开头的名称,因此可以执行以下操作:

((?:[A-Z]\w+\s+)+)and\s+((?:[A-Z]\w+(?:\s+|\b))+)

Live preview 实时预览

Swapping out your current pattern, with this pattern should work with your current Python code. 交换当前的模式,使用此模式应该可以与当前的Python代码一起使用。 You do need to strip() the captured results though. 您确实需要strip()捕获的结果。

Which for your examples and current code would yield: 您的示例和当前代码将产生以下结果:

Input
First print
Second print

John and Albert McDonald
John McDonald
Albert McDonald

Stephen Stewart, John and Albert Diamond
John Diamond
Albert Diamond

It was a great day hanging out with John and Stephen Diamond.
John Diamond
Stephen Diamond

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM