简体   繁体   English

忽略不完全匹配正则表达式的字符串?

[英]Ignore strings which do not completely match regex?

I want to return all recipients of an email using regex. 我想使用正则表达式返回电子邮件的所有收件人。 For example: 例如:

Date: Wed, 6 Dec 2000 02:03:00 -0800 (PST)
From: donald.herrick@enron.com
To: brianherrick@email.msn.com, herriceu2@tdprs.state.tx.us, 
    robertherrick@bankunited.com, kristi.demaiolo@enron.com, 
    suresh.raghavan@enron.com, harry.arora@enron.com
Subject: FW: If Santa Answered his mail...
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Donald W Herrick
X-To: brianherrick@email.msn.com, HERRICEU2@tdprs.state.tx.us, RobertHerrick@bankunited.com, Kristi Demaiolo, Suresh Raghavan, Harry Arora
X-cc: 
X-bcc: 

Should return (from the "To: " line) brianherrick@email.msn.com, herriceu2@tdprs.state.tx.us, robertherrick@bankunited.com, kristi.demaiolo@enron.com, suresh.raghavan@enron.com, harry.arora@enron.com 应该返回(从“收件人:”行) brianherrick @ email.msn.com,herriceu2 @ tdprs.state.tx.us,robertherrick @ bankunited.com,kristi.demaiolo @ enron.com,suresh.raghavan @ enron.com ,harry.arora @ enron.com

but not (from the "X-To: " line) brianherrick@email.msn.com, HERRICEU2@tdprs.state.tx.us, RobertHerrick@bankunited.com . 不是 (来自“ X-To:”行) brianherrick @ email.msn.com,HERRICEU2 @ tdprs.state.tx.us,RobertHerrick @ bankunited.com

My current regex is re.findall([To:\\s][\\w\\.-]+@[\\w\\.-]+, text) which returns everything from the "To:", "X-To: " and "From: " line. 我当前的正则表达式是re.findall([To:\\s][\\w\\.-]+@[\\w\\.-]+, text) ,它返回“ To:”,“ X-To:”中的所有内容和“发件人:”行。

My questions: 我的问题:

  1. Why is the email address on the "From: " line also returned? 为什么还返回“发件人:”行上的电子邮件地址? It doesn't match the [To:\\s] part of the regex?! 它与正则表达式的[To:\\s]部分不匹配?
  2. How can I ensure that only email addresses which follow "To: " are returned? 如何确保仅返回“收件人:”之后的电子邮件地址? (That is, how do I exclude email addresses following "X-To: "? I think that you can use lookahead assertions for this but I am not sure how to do this? (也就是说,如何排除“ X-To:”之后的电子邮件地址?我认为您可以为此使用先行断言,但是我不确定该怎么做?

As an addendum to @MartijnPieters 's answer, regex may not be the right tool for the JOB. 作为@MartijnPieters答案的附录,正则表达式可能不是JOB的正确工具。 To parse an email message, it is recommended to use email.parser 要解析电子邮件,建议使用email.parser

>>> from email.parser import Parser
>>> headers = Parser().parsestr(email_str)
>>> pprint.pprint(map(str.strip, headers['to'].split()))
['brianherrick@email.msn.com,',
 'herriceu2@tdprs.state.tx.us,',
 'robertherrick@bankunited.com,',
 'kristi.demaiolo@enron.com,',
 'suresh.raghavan@enron.com,',
 'harry.arora@enron.com']

You've misunderstood what a character class does; 您误解了角色类的作用; your pattern matches anywhere a string contains a T , o , : or whitespace character. 随时随地模式相匹配的字符串包含一个To:或空白字符。

That's because [To:\\s] models a character class , any one character in the set will match. 这是因为[To:\\s]一个字符类建模,集合中的任何一个字符都将匹配。 This is why your From: line matches; 这就是为什么您的From:行匹配; the space between : and d suffices here. :d之间的空格在这里就足够了。

If you need to validate the whole header name, anchor your match to the start of lines with ^ , but remove that character class: 如果您需要验证整个标题名称,请使用^将匹配项锚定到行的开头,但删除该字符类:

r'^To:\s+[\w\.-]+@[\w\.-]+'

Now the To: part only matches if at the start of a line, provided you use the re.MULTILINE flag: 现在,如果使用re.MULTILINE标志,则To:部分仅在行的开头才匹配:

>>> import re
>>> text = '''\
... Date: Wed, 6 Dec 2000 02:03:00 -0800 (PST)
... From: donald.herrick@enron.com
... To: brianherrick@email.msn.com, herriceu2@tdprs.state.tx.us, 
...     robertherrick@bankunited.com, kristi.demaiolo@enron.com, 
...     suresh.raghavan@enron.com, harry.arora@enron.com
... Subject: FW: If Santa Answered his mail...
... Mime-Version: 1.0
... Content-Type: text/plain; charset=us-ascii
... Content-Transfer-Encoding: 7bit
... X-From: Donald W Herrick
... X-To: brianherrick@email.msn.com, HERRICEU2@tdprs.state.tx.us, RobertHerrick@bankunited.com, Kristi Demaiolo, Suresh Raghavan, Harry Arora
... X-cc: 
... X-bcc: 
... '''
>>> re.findall(r'^To:\s+[\w\.-]+@[\w\.-]+', text)
[]
>>> re.findall(r'^To:\s+[\w\.-]+@[\w\.-]+', text, flags=re.M)
['To: brianherrick@email.msn.com']

This can only ever match the first email address, and only if it doesn't include anything like a full name ( Brian Herrick <brianherrick@email.msn.com> , for example). 这只能匹配第一个电子邮件地址,并且不包含全名(例如Brian Herrick <brianherrick@email.msn.com> )。

You'd have to match the whole header instead: 您必须匹配整个标题

re.findall(r'^To:\s+((?:.*(?:\n[ \t]+)?)*)', text, flags=re.M)

This matches the To: header followed by any number of header continuation lines (starting with whitespace): 这与To:头匹配,后跟任意数量的头连续行(以空格开头):

>>> re.findall(r'^To:\s+((?:.*(?:\n[ \t]+)?)*)', text, flags=re.M)
['brianherrick@email.msn.com, herriceu2@tdprs.state.tx.us, \n    robertherrick@bankunited.com, kristi.demaiolo@enron.com, \n    suresh.raghavan@enron.com, harry.arora@enron.com']

and you'd have to extract email addresses separately from that. 并且您必须分别提取电子邮件地址。

Personally, I'd be looking into the email package instead, it would make it much easier to grab headers with that: 就我个人而言,我将研究email ,这将使抓取标头变得更加容易:

import email

message = email.message_from_string(text)
to_headers = message.get_all('to')
addresses = email.utils.getaddresses(to_headers)

Demo: 演示:

>>> import email
>>> m = email.message_from_string(text)
>>> m.get_all('to')
['brianherrick@email.msn.com, herriceu2@tdprs.state.tx.us, \n    robertherrick@bankunited.com, kristi.demaiolo@enron.com, \n    suresh.raghavan@enron.com, harry.arora@enron.com']
>>> email.utils.getaddresses(m.get_all('to'))
[('', 'brianherrick@email.msn.com'), ('', 'herriceu2@tdprs.state.tx.us'), ('', 'robertherrick@bankunited.com'), ('', 'kristi.demaiolo@enron.com'), ('', 'suresh.raghavan@enron.com'), ('', 'harry.arora@enron.com')]

Now you have all the email addresses. 现在,您拥有了所有的电子邮件地址。

The email.utils.getaddresses() function can also be applied when using the regular expression: 当使用正则表达式时,也可以应用email.utils.getaddresses()函数

>>> email.utils.getaddresses(re.findall(r'^To:\s+((?:.*(?:\n[ \t]+)?)*)', text, flags=re.M))
[('', 'brianherrick@email.msn.com'), ('', 'herriceu2@tdprs.state.tx.us'), ('', 'robertherrick@bankunited.com'), ('', 'kristi.demaiolo@enron.com'), ('', 'suresh.raghavan@enron.com'), ('', 'harry.arora@enron.com')]

regex module: infinite lookbehind and other features regex模块:无限向后看及其他功能

If you want to use regex, I suggest you use the outstanding regex module instead of re . 如果要使用正则表达式,建议您使用出色的regex模块,而不要使用re This regex will return all matches: 此正则表达式将返回所有匹配项:

(?<=(?<!X\S+)To:\s*(?:[^@\s]+@[^\,\s]+,\s*)*?)[^@\s]+@[^\,\s]+

Sample Code 样例代码

I tested this in Python 3.4. 我在Python 3.4中对此进行了测试。

import regex
subject = """Date: Wed, 6 Dec 2000 02:03:00 -0800 (PST)
From: donald.herrick@enron.com
To: brianherrick@email.msn.com, herriceu2@tdprs.state.tx.us, 
    robertherrick@bankunited.com, kristi.demaiolo@enron.com, 
    suresh.raghavan@enron.com, harry.arora@enron.com
Subject: FW: If Santa Answered his mail...
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Donald W Herrick
X-To: brianherrick@email.msn.com, HERRICEU2@tdprs.state.tx.us, RobertHerrick@bankunited.com, Kristi Demaiolo, Suresh Raghavan, Harry Arora
X-cc: 
X-bcc: """
pattern = "(?<=(?<!X\S+)To:\s*(?:[^@\s]+@[^\,\s]+,\s*)*?)[^@\s]+@[^\,\s]+"

for match in regex.finditer(pattern, subject):
    print(match.group())

Output 产量

brianherrick@email.msn.com
herriceu2@tdprs.state.tx.us
robertherrick@bankunited.com
kristi.demaiolo@enron.com
suresh.raghavan@enron.com
harry.arora@enron.com

Explanation 说明

  • We have one big lookbehind, then a very basic email matcher: [^@\\s]+@[^\\,\\s]+ which matches any chars that are not an arrobas or whitespace char, then an arrobas, then any chars that are not a comma or whitespace char (the end-of-email delimiters in your input) 我们后面有一个很大的回望,然后是一个非常基本的电子邮件匹配器: [^@\\s]+@[^\\,\\s]+ ,它匹配不是arrobas或空白字符的任何字符,然后是arrobas,然后是不是逗号或空格字符(输入中的电子邮件结尾定界符)
  • That email matcher can be replaced by a more sophisticated email regex if need be 如果需要,可以用更复杂的电子邮件正则表达式代替该电子邮件匹配器
  • Now to the big lookbehind ``(?<=(? 现在到后面的大表情``(?<=(?
  • The first part (?<!X-)To:\\s* matches To: as long as it is not preceded by Xsomething , as asserted by the negative lookbehind (?<!X-) 第一部分(?<!X-)To:\\s*To:匹配,只要不以Xsomething ,就由否定的后向(?<!X-)断言
  • The non-capture groups (?:[^@\\s]+@[^\\,\\s]+,\\s*)*? 非捕获组(?:[^@\\s]+@[^\\,\\s]+,\\s*)*? matches as few as needed (the *? quantifiers) of the expression [^@\\s]+@[^\\,\\s]+,\\s* to allow what follows the lookbehind to match. 匹配表达式[^@\\s]+@[^\\,\\s]+,\\s*所需的数量( *? )以允许后面的匹配。 This is an "email skipper" that lets us gradually skip more and more emails with every match 这是一个“电子邮件跳过程序”,我们可以在每次匹配时逐渐跳过越来越多的电子邮件
  • [^@\\s]+@[^\\,\\s]+,\\s* is simply a crude email followed by a coma and optional white-space chars (the \\s matches not only spaces but also carriage returns, tabs etc.) [^@\\s]+@[^\\,\\s]+,\\s*只是简单的电子邮件,后跟一个逗号和可选的空格字符( \\s不仅匹配空格,而且还匹配回车符,制表符等。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM