Python 从一个巨大的字符串中提取 email 地址

Question

I have been using this: (I know, there are probably more efficient ways...)我一直在使用这个：（我知道，可能有更有效的方法......）

Given this in an email message:在 email 消息中给出了这一点：

Submitted data:
First Name: MyName
Your Email Address: email@domain.com
TAG:

I coded this:我编码了这个：

intStart = (bodystring.rfind('First ')) + 12
intEnd = (bodystring.rfind('Your Email'))
receiver_name = bodystring[intStart:intEnd]

intStart = (bodystring.rfind('Your Email Address: ')) + 20
intEnd = (bodystring.rfind('TAG:'))
receiver_email = bodystring[intStart:intEnd]

... and got what I needed. ...并得到了我需要的东西。 This worked because I had the 'TAG' label.这很有效，因为我有'TAG' label。

Now I am given this:现在我得到了这个：

Submitted data:
First name: MyName
Last name:
Email: email@domain.com

I'm having a brain block on getting the email address without a next word.在没有下一个单词的情况下，我在获取 email 地址时遇到了障碍。 There is whitespace.有空格。 Can someone nudge me in the right direction?有人可以将我推向正确的方向吗？ I suspect I can dig out the email address after the occurrence of 'Email:' using regex...我怀疑我可以使用正则表达式在'Email:'出现后挖掘出 email 地址......

Answer 1

You can, in fact, make use of RegEx to extract e-mails.事实上，您可以使用 RegEx 提取电子邮件。

To find single e-mails in a text, you can make use of re.search().group()要在文本中查找单个电子邮件，可以使用re.search().group()
In case you want to find multiple emails, you can make use of re.findall()如果您想查找多封电子邮件，可以使用re.findall()

An example一个例子

    import re
    text = "First name: MyName Last name: Email: email@domain.com "
    
    email = re.search(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", text)
    print(email.group())
    
    emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", text)
    print (emails)

This would give the output as这将使 output 为

email@domain.com
['email@domain.com']

Answer 2

Searching for strings is often better done with splitting, and occasionally regular expressions.搜索字符串通常最好使用拆分，偶尔使用正则表达式。 So first split the lines:所以首先拆分行：

bodylines = bodystring.splitlines()

Split the resulting lines on the : delimiter (make a generator):在:分隔符上拆分结果行（制作生成器）：

chunks = (line.split(':') for line in bodylines)

Now grab the first one that has "email" on the left and @ on the right:现在抓住左边有“email”和右边有@的第一个：

address = next(val.strip() for key, val in chunks if 'email' in key.lower() and '@' in val)

If you want all the emails across multiple lines, replace next with a list comprehension:如果您想要跨多行的所有电子邮件，请将next替换为列表理解：

addresses = [val.strip() for key, val in chunks if 'email' in key.lower() and '@' in val]

This can be done in one line with no imports (if you replace chunks with its definition, not that I recommend it).这可以在没有导入的情况下在一行中完成（如果你用它的定义替换chunks ，我不推荐它）。 Regex are a much heavier tool that allow you to specify much more general patterns, but are also much slower as a result.正则表达式是一个更重的工具，它允许您指定更通用的模式，但结果也慢得多。 If you can get away with simple and effective tools, do it: don't bring in the sledgehammer until you need it!如果您可以使用简单而有效的工具摆脱困境，那就去做吧：在您需要之前不要使用大锤！

Answer 3

If the email should come after the word Email followed by a : , you could match the Name part, and capture the email in a group with an email like pattern. If the email should come after the word Email followed by a : , you could match the Name part, and capture the email in a group with an email like pattern.

\bEmail[^:]*:\s*([^\s@]+@[^\s@]+)

\bEmail A word boundary to prevent a partial match, match Email \bEmail一个字边界防止部分匹配，匹配Email
[^:]*:\s* Match optional chars other than : , then match : and optional whitespace chars [^:]*:\s*匹配除:以外的可选字符，然后匹配:和可选空白字符
( Capture group 1 (捕获组 1
- [^\s@]+@[^\s@]+ Match a single @ between 1+ more non whitespace chars ecluding the @ itself [^\s@]+@[^\s@]+在 1+ 多个非空白字符之间匹配单个@排除@本身
) Close group 1 )关闭第 1 组

Regex demo正则表达式演示

Example with re.findall that returns the values of the capture groups:返回捕获组值的 re.findall 示例：

import re
 
regex = r"\bEmail[^:]*:\s*([^\s@]+@[^\s@]+)"
 
s = ("Submitted data:\n"
    "First Name: MyName\n"
    "Your Email Address: email@domain.com\n"
    "TAG:\n\n"
    "Submitted data:\n"
    "First name: MyName\n"
    "Last name:\n"
    "Email: email@domain.com")
 
print(re.findall(regex, s))

Output Output

['email@domain.com', 'email@domain.com']

Python 从一个巨大的字符串中提取 email 地址

问题描述

3 个解决方案

解决方案1
1 2021-05-06 17:24:30

解决方案2
1 2021-05-06 17:49:33

解决方案3
1 2021-05-06 19:12:30

Python 从一个巨大的字符串中提取 email 地址

问题描述

3 个解决方案

解决方案1 1 2021-05-06 17:24:30

解决方案2 1 2021-05-06 17:49:33

解决方案3 1 2021-05-06 19:12:30

解决方案1
1 2021-05-06 17:24:30

解决方案2
1 2021-05-06 17:49:33

解决方案3
1 2021-05-06 19:12:30