简体   繁体   English

Python和正则表达式子字符串

[英]Python and Regular Expression Substring

I'm attempting to do this: 我正在尝试这样做:

p = re.compile(ur'([A-Z]\w+\s+[A-Z]\w+)|([A-Z]\w+)(?=\s+and\s+[A-Z]\w+\s+([A-Z]\w+))', re.MULTILINE)
test_str = u"Russ Middleton and Lisa Murro\nRon Iervolino, Trish and Russ Middleton, and Lisa Middleton \nRon Iervolino, Kelly  and Tom Murro\nRon Iervolino, Trish and Russ Middleton and Lisa Middleton "
subst = u"$1$2 $3"
result = re.sub(p, subst, test_str)

The goal is to get something that both matches all the names and fills in last names when necessary (eg, Trish and Russ Middleton becomes Trish Middleton and Russ Middleton). 我们的目标是获得既能匹配所有名称又能在必要时填写姓氏的东西(例如Trish and Russ Middleton变为Trish Middleton和Russ Middleton)。 In the end, I'm looking for the names that appear together in a single line. 最后,我要寻找出现在同一行中的名称。

Someone else was kind enough to help me with the regex , and I thought I knew how to write it programmatically in Python (although I'm new to Python). 有人很友好地为我提供了正则表达式方面的帮助 ,我以为我知道如何用Python编写程序(尽管我是Python的新手)。 Not being able to get it, I resorted to using the code generated by Regex101 (the code shown above). 无法获取它,我不得不使用Regex101生成的代码(上面显示的代码)。 However, all I get in result is: 但是,我得到的result是:

u'$1$2 $3 and $1$2 $3\n$1$2 $3, $1$2 $3 and $1$2 $3, and $1$2 $3 \n$1$2 $3, $1$2 $3  and $1$2 $3\n$1$2 $3, $1$2 $3 and $1$2 $3 and $1$2 $3 '

What am I missing with Python and regular expressions? 我对Python和正则表达式缺少什么?

You're not using the right syntax for subst -- try, rather 你没有使用正确的语法subst -尝试,而

subst = r'\1\2 \3'

However, now you have the problem there aren't three matched groups in the matches. 但是,现在您遇到的问题是比赛中没有三个匹配的组。

Specifically: 特别:

>>> for x in p.finditer(test_str): print(x.groups())
... 
('Russ Middleton', None, None)
('Lisa Murro', None, None)
('Ron Iervolino', None, None)
(None, 'Trish', 'Middleton')
('Russ Middleton', None, None)
('Lisa Middleton', None, None)
('Ron Iervolino', None, None)
(None, 'Kelly', 'Murro')
('Tom Murro', None, None)
('Ron Iervolino', None, None)
(None, 'Trish', 'Middleton')
('Russ Middleton', None, None)
('Lisa Middleton', None, None)

whenever you see a None here, it will be an error to try and interpolate the corresponding group ( \\1 , etc) in a substitution. 每当您在此处看到“ None ,尝试插入相应的组( \\1等)将是错误的。

A function can be more flexible: 函数可以更灵活:

>>> def mysub(mo):
...   return '{}{} {}'.format(
...     mo.group(1) or '',
...     mo.group(2) or '',
...     mo.group(3) or '')
... 
>>> result = re.sub(p, mysub, test_str)
>>> result
'Russ Middleton  and Lisa Murro \nRon Iervolino , Trish Middleton and Russ Middleton , and Lisa Middleton  \nRon Iervolino , Kelly Murro  and Tom Murro \nRon Iervolino , Trish Middleton and Russ Middleton  and Lisa Middleton  '

Here, I've coded mysub to do what I suspect you thought a substitution string with group numbers would do for you -- use an empty string where a group did not match (ie, the corresponding mo.group(...) is None ). 在这里,我已经编码了mysub来执行我怀疑您认为带有组号的替换字符串会为您做的事情-在组不匹配的地方使用空字符串(即,对应的mo.group(...)None )。

I suggest you a simple solution. 我建议您一个简单的解决方案。

import re
string = """Russ Middleton and Lisa Murro
Ron Iervolino, Trish and Russ Middleton, and Lisa Middleton 
Ron Iervolino, Kelly  and Tom Murro
Ron Iervolino, Trish and Russ Middleton and Lisa Middleton """
m = re.sub(r'(?<=,\s)([A-Z]\w+)(?=\s+and\s+[A-Z]\w+\s+([A-Z]\w+))', r'\1 \2', string)
print(m)

Output: 输出:

Russ Middleton and Lisa Murro
Ron Iervolino, Trish Middleton and Russ Middleton, and Lisa Middleton 
Ron Iervolino, Kelly Murro  and Tom Murro
Ron Iervolino, Trish Middleton and Russ Middleton and Lisa Middleton

DEMO DEMO

OR 要么

import regex
string = """Russ Middleton and Lisa Murro
Ron Iervolino, Trish and Russ Middleton, and Lisa Middleton 
Ron Iervolino, Kelly  and Tom Murro
Ron Iervolino, Trish and Russ Middleton and Lisa Middleton 
Trish and Russ Middleton"""
m = regex.sub(r'(?<!\b[A-Z]\w+\s)([A-Z]\w+)(?=\s+and\s+[A-Z]\w+\s+([A-Z]\w+))', r'\1 \2', string)
print(m)

Output: 输出:

Russ Middleton and Lisa Murro
Ron Iervolino, Trish Middleton and Russ Middleton, and Lisa Middleton 
Ron Iervolino, Kelly Murro  and Tom Murro
Ron Iervolino, Trish Middleton and Russ Middleton and Lisa Middleton 
Trish Middleton and Russ Middleton

Alex: I see what you're saying about the groups. 亚历克斯:我明白你对这些团体的看法。 That didn't occur to me. 那不是我想的。 Thanks! 谢谢!

I took a fresh (ish) approach. 我采取了一种新的方法。 This appears to be working. 这似乎正在工作。 Any thoughts on it? 有什么想法吗?

p = re.compile(ur'([A-Z]\w+\s+[A-Z]\w+)|([A-Z]\w+)(?=\s+and\s+[A-Z]\w+\s+([A-Z]\w+))', re.MULTILINE)
temp_result = p.findall(s)
joiner = " ".join
out = [joiner(words).strip() for words in temp_result]

Here is some input data: 这是一些输入数据:

test_data = ['John Smith, Barri Lieberman, Nancy Drew','Carter Bays and Craig Thomas','John Smith and Carter Bays',
                     'Jena Silverman, John Silverman, Tess Silverman, and Dara Silverman', 'Tess and Dara Silverman',
                     'Nancy Drew, John Smith, and Daniel Murphy', 'Jonny Podell']

I put the code above in a function so I could call it on every item in the list. 我将上面的代码放在一个函数中,以便可以在列表中的每个项目上调用它。 Calling it on the list above, I get as output (from the function) this: 在上面的列表中调用它,得到以下输出(从函数中):

['John Smith', 'Barri Lieberman', 'Nancy Drew']
['Carter Bays', 'Craig Thomas']
['John Smith', 'Carter Bays']
['Jena Silverman', 'John Silverman', 'Tess Silverman', 'Dara Silverman']
['Tess Silverman', 'Dara Silverman']
['Nancy Drew', 'John Smith', 'Daniel Murphy']
['Jonny Podell']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM