繁体   English   中英

Python和正则表达式子字符串

[英]Python and Regular Expression Substring

我正在尝试这样做:

p = re.compile(ur'([A-Z]\w+\s+[A-Z]\w+)|([A-Z]\w+)(?=\s+and\s+[A-Z]\w+\s+([A-Z]\w+))', re.MULTILINE)
test_str = u"Russ Middleton and Lisa Murro\nRon Iervolino, Trish and Russ Middleton, and Lisa Middleton \nRon Iervolino, Kelly  and Tom Murro\nRon Iervolino, Trish and Russ Middleton and Lisa Middleton "
subst = u"$1$2 $3"
result = re.sub(p, subst, test_str)

我们的目标是获得既能匹配所有名称又能在必要时填写姓氏的东西(例如Trish and Russ Middleton变为Trish Middleton和Russ Middleton)。 最后,我要寻找出现在同一行中的名称。

有人很友好地为我提供了正则表达式方面的帮助 ,我以为我知道如何用Python编写程序(尽管我是Python的新手)。 无法获取它,我不得不使用Regex101生成的代码(上面显示的代码)。 但是,我得到的result是:

u'$1$2 $3 and $1$2 $3\n$1$2 $3, $1$2 $3 and $1$2 $3, and $1$2 $3 \n$1$2 $3, $1$2 $3  and $1$2 $3\n$1$2 $3, $1$2 $3 and $1$2 $3 and $1$2 $3 '

我对Python和正则表达式缺少什么?

你没有使用正确的语法subst -尝试,而

subst = r'\1\2 \3'

但是,现在您遇到的问题是比赛中没有三个匹配的组。

特别:

>>> for x in p.finditer(test_str): print(x.groups())
... 
('Russ Middleton', None, None)
('Lisa Murro', None, None)
('Ron Iervolino', None, None)
(None, 'Trish', 'Middleton')
('Russ Middleton', None, None)
('Lisa Middleton', None, None)
('Ron Iervolino', None, None)
(None, 'Kelly', 'Murro')
('Tom Murro', None, None)
('Ron Iervolino', None, None)
(None, 'Trish', 'Middleton')
('Russ Middleton', None, None)
('Lisa Middleton', None, None)

每当您在此处看到“ None ,尝试插入相应的组( \\1等)将是错误的。

函数可以更灵活:

>>> def mysub(mo):
...   return '{}{} {}'.format(
...     mo.group(1) or '',
...     mo.group(2) or '',
...     mo.group(3) or '')
... 
>>> result = re.sub(p, mysub, test_str)
>>> result
'Russ Middleton  and Lisa Murro \nRon Iervolino , Trish Middleton and Russ Middleton , and Lisa Middleton  \nRon Iervolino , Kelly Murro  and Tom Murro \nRon Iervolino , Trish Middleton and Russ Middleton  and Lisa Middleton  '

在这里,我已经编码了mysub来执行我怀疑您认为带有组号的替换字符串会为您做的事情-在组不匹配的地方使用空字符串(即,对应的mo.group(...)None )。

我建议您一个简单的解决方案。

import re
string = """Russ Middleton and Lisa Murro
Ron Iervolino, Trish and Russ Middleton, and Lisa Middleton 
Ron Iervolino, Kelly  and Tom Murro
Ron Iervolino, Trish and Russ Middleton and Lisa Middleton """
m = re.sub(r'(?<=,\s)([A-Z]\w+)(?=\s+and\s+[A-Z]\w+\s+([A-Z]\w+))', r'\1 \2', string)
print(m)

输出:

Russ Middleton and Lisa Murro
Ron Iervolino, Trish Middleton and Russ Middleton, and Lisa Middleton 
Ron Iervolino, Kelly Murro  and Tom Murro
Ron Iervolino, Trish Middleton and Russ Middleton and Lisa Middleton

DEMO

要么

import regex
string = """Russ Middleton and Lisa Murro
Ron Iervolino, Trish and Russ Middleton, and Lisa Middleton 
Ron Iervolino, Kelly  and Tom Murro
Ron Iervolino, Trish and Russ Middleton and Lisa Middleton 
Trish and Russ Middleton"""
m = regex.sub(r'(?<!\b[A-Z]\w+\s)([A-Z]\w+)(?=\s+and\s+[A-Z]\w+\s+([A-Z]\w+))', r'\1 \2', string)
print(m)

输出:

Russ Middleton and Lisa Murro
Ron Iervolino, Trish Middleton and Russ Middleton, and Lisa Middleton 
Ron Iervolino, Kelly Murro  and Tom Murro
Ron Iervolino, Trish Middleton and Russ Middleton and Lisa Middleton 
Trish Middleton and Russ Middleton

亚历克斯:我明白你对这些团体的看法。 那不是我想的。 谢谢!

我采取了一种新的方法。 这似乎正在工作。 有什么想法吗?

p = re.compile(ur'([A-Z]\w+\s+[A-Z]\w+)|([A-Z]\w+)(?=\s+and\s+[A-Z]\w+\s+([A-Z]\w+))', re.MULTILINE)
temp_result = p.findall(s)
joiner = " ".join
out = [joiner(words).strip() for words in temp_result]

这是一些输入数据:

test_data = ['John Smith, Barri Lieberman, Nancy Drew','Carter Bays and Craig Thomas','John Smith and Carter Bays',
                     'Jena Silverman, John Silverman, Tess Silverman, and Dara Silverman', 'Tess and Dara Silverman',
                     'Nancy Drew, John Smith, and Daniel Murphy', 'Jonny Podell']

我将上面的代码放在一个函数中,以便可以在列表中的每个项目上调用它。 在上面的列表中调用它,得到以下输出(从函数中):

['John Smith', 'Barri Lieberman', 'Nancy Drew']
['Carter Bays', 'Craig Thomas']
['John Smith', 'Carter Bays']
['Jena Silverman', 'John Silverman', 'Tess Silverman', 'Dara Silverman']
['Tess Silverman', 'Dara Silverman']
['Nancy Drew', 'John Smith', 'Daniel Murphy']
['Jonny Podell']

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM