[英]Python extract email address from a HUGE string
I have been using this: (I know, there are probably more efficient ways...)我一直在使用这个:(我知道,可能有更有效的方法......)
Given this in an email message:在 email 消息中给出了这一点:
Submitted data:
First Name: MyName
Your Email Address: email@domain.com
TAG:
I coded this:我编码了这个:
intStart = (bodystring.rfind('First ')) + 12
intEnd = (bodystring.rfind('Your Email'))
receiver_name = bodystring[intStart:intEnd]
intStart = (bodystring.rfind('Your Email Address: ')) + 20
intEnd = (bodystring.rfind('TAG:'))
receiver_email = bodystring[intStart:intEnd]
... and got what I needed. ...并得到了我需要的东西。 This worked because I had the
'TAG'
label.这很有效,因为我有
'TAG'
label。
Now I am given this:现在我得到了这个:
Submitted data:
First name: MyName
Last name:
Email: email@domain.com
I'm having a brain block on getting the email address without a next word.在没有下一个单词的情况下,我在获取 email 地址时遇到了障碍。 There is whitespace.
有空格。 Can someone nudge me in the right direction?
有人可以将我推向正确的方向吗? I suspect I can dig out the email address after the occurrence of
'Email:'
using regex...我怀疑我可以使用正则表达式在
'Email:'
出现后挖掘出 email 地址......
You can, in fact, make use of RegEx to extract e-mails.事实上,您可以使用 RegEx 提取电子邮件。
To find single e-mails in a text, you can make use of re.search().group()
要在文本中查找单个电子邮件,可以使用
re.search().group()
In case you want to find multiple emails, you can make use of re.findall()
如果您想查找多封电子邮件,可以使用
re.findall()
An example一个例子
import re
text = "First name: MyName Last name: Email: email@domain.com "
email = re.search(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", text)
print(email.group())
emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", text)
print (emails)
This would give the output as这将使 output 为
email@domain.com
['email@domain.com']
Searching for strings is often better done with splitting, and occasionally regular expressions.搜索字符串通常最好使用拆分,偶尔使用正则表达式。 So first split the lines:
所以首先拆分行:
bodylines = bodystring.splitlines()
Split the resulting lines on the :
delimiter (make a generator):在
:
分隔符上拆分结果行(制作生成器):
chunks = (line.split(':') for line in bodylines)
Now grab the first one that has "email" on the left and @
on the right:现在抓住左边有“email”和右边有
@
的第一个:
address = next(val.strip() for key, val in chunks if 'email' in key.lower() and '@' in val)
If you want all the emails across multiple lines, replace next
with a list comprehension:如果您想要跨多行的所有电子邮件,请将
next
替换为列表理解:
addresses = [val.strip() for key, val in chunks if 'email' in key.lower() and '@' in val]
This can be done in one line with no imports (if you replace chunks
with its definition, not that I recommend it).这可以在没有导入的情况下在一行中完成(如果你用它的定义替换
chunks
,我不推荐它)。 Regex are a much heavier tool that allow you to specify much more general patterns, but are also much slower as a result.正则表达式是一个更重的工具,它允许您指定更通用的模式,但结果也慢得多。 If you can get away with simple and effective tools, do it: don't bring in the sledgehammer until you need it!
如果您可以使用简单而有效的工具摆脱困境,那就去做吧:在您需要之前不要使用大锤!
If the email should come after the word Email followed by a :
, you could match the Name part, and capture the email in a group with an email like pattern. If the email should come after the word Email followed by a
:
, you could match the Name part, and capture the email in a group with an email like pattern.
\bEmail[^:]*:\s*([^\s@]+@[^\s@]+)
\bEmail
A word boundary to prevent a partial match, match Email
\bEmail
一个字边界防止部分匹配,匹配Email
[^:]*:\s*
Match optional chars other than :
, then match :
and optional whitespace chars [^:]*:\s*
匹配除:
以外的可选字符,然后匹配:
和可选空白字符(
Capture group 1 (
捕获组 1
[^\s@]+@[^\s@]+
Match a single @
between 1+ more non whitespace chars ecluding the @
itself [^\s@]+@[^\s@]+
在 1+ 多个非空白字符之间匹配单个@
排除@
本身)
Close group 1 )
关闭第 1 组Example with re.findall that returns the values of the capture groups:返回捕获组值的 re.findall 示例:
import re
regex = r"\bEmail[^:]*:\s*([^\s@]+@[^\s@]+)"
s = ("Submitted data:\n"
"First Name: MyName\n"
"Your Email Address: email@domain.com\n"
"TAG:\n\n"
"Submitted data:\n"
"First name: MyName\n"
"Last name:\n"
"Email: email@domain.com")
print(re.findall(regex, s))
Output Output
['email@domain.com', 'email@domain.com']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.