简体   繁体   English

Python 从一个巨大的字符串中提取 email 地址

[英]Python extract email address from a HUGE string

I have been using this: (I know, there are probably more efficient ways...)我一直在使用这个:(我知道,可能有更有效的方法......)

Given this in an email message:在 email 消息中给出了这一点:

Submitted data:
First Name: MyName
Your Email Address: email@domain.com
TAG:

I coded this:我编码了这个:

intStart = (bodystring.rfind('First ')) + 12
intEnd = (bodystring.rfind('Your Email'))
receiver_name = bodystring[intStart:intEnd]

intStart = (bodystring.rfind('Your Email Address: ')) + 20
intEnd = (bodystring.rfind('TAG:'))
receiver_email = bodystring[intStart:intEnd]

... and got what I needed. ...并得到了我需要的东西。 This worked because I had the 'TAG' label.这很有效,因为我有'TAG' label。

Now I am given this:现在我得到了这个:

Submitted data:
First name: MyName
Last name:
Email: email@domain.com

I'm having a brain block on getting the email address without a next word.在没有下一个单词的情况下,我在获取 email 地址时遇到了障碍。 There is whitespace.有空格。 Can someone nudge me in the right direction?有人可以将我推向正确的方向吗? I suspect I can dig out the email address after the occurrence of 'Email:' using regex...我怀疑我可以使用正则表达式在'Email:'出现后挖掘出 email 地址......

You can, in fact, make use of RegEx to extract e-mails.事实上,您可以使用 RegEx 提取电子邮件。

  • To find single e-mails in a text, you can make use of re.search().group()要在文本中查找单个电子邮件,可以使用re.search().group()

  • In case you want to find multiple emails, you can make use of re.findall()如果您想查找多封电子邮件,可以使用re.findall()

An example一个例子

    import re
    text = "First name: MyName Last name: Email: email@domain.com "
    
    email = re.search(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", text)
    print(email.group())
    
    emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", text)
    print (emails)

This would give the output as这将使 output 为

email@domain.com
['email@domain.com']

Searching for strings is often better done with splitting, and occasionally regular expressions.搜索字符串通常最好使用拆分,偶尔使用正则表达式。 So first split the lines:所以首先拆分行:

bodylines = bodystring.splitlines()

Split the resulting lines on the : delimiter (make a generator)::分隔符上拆分结果行(制作生成器):

chunks = (line.split(':') for line in bodylines)

Now grab the first one that has "email" on the left and @ on the right:现在抓住左边有“email”和右边有@的第一个:

address = next(val.strip() for key, val in chunks if 'email' in key.lower() and '@' in val)

If you want all the emails across multiple lines, replace next with a list comprehension:如果您想要跨多行的所有电子邮件,请将next替换为列表理解:

addresses = [val.strip() for key, val in chunks if 'email' in key.lower() and '@' in val]

This can be done in one line with no imports (if you replace chunks with its definition, not that I recommend it).这可以在没有导入的情况下在一行中完成(如果你用它的定义替换chunks ,我不推荐它)。 Regex are a much heavier tool that allow you to specify much more general patterns, but are also much slower as a result.正则表达式是一个更重的工具,它允许您指定更通用的模式,但结果也慢得多。 If you can get away with simple and effective tools, do it: don't bring in the sledgehammer until you need it!如果您可以使用简单而有效的工具摆脱困境,那就去做吧:在您需要之前不要使用大锤!

If the email should come after the word Email followed by a : , you could match the Name part, and capture the email in a group with an email like pattern. If the email should come after the word Email followed by a : , you could match the Name part, and capture the email in a group with an email like pattern.

\bEmail[^:]*:\s*([^\s@]+@[^\s@]+)
  • \bEmail A word boundary to prevent a partial match, match Email \bEmail一个字边界防止部分匹配,匹配Email
  • [^:]*:\s* Match optional chars other than : , then match : and optional whitespace chars [^:]*:\s*匹配除:以外的可选字符,然后匹配:和可选空白字符
  • ( Capture group 1 (捕获组 1
    • [^\s@]+@[^\s@]+ Match a single @ between 1+ more non whitespace chars ecluding the @ itself [^\s@]+@[^\s@]+在 1+ 多个非空白字符之间匹配单个@排除@本身
  • ) Close group 1 )关闭第 1 组

Regex demo正则表达式演示

Example with re.findall that returns the values of the capture groups:返回捕获组值的 re.findall 示例:

import re
 
regex = r"\bEmail[^:]*:\s*([^\s@]+@[^\s@]+)"
 
s = ("Submitted data:\n"
    "First Name: MyName\n"
    "Your Email Address: email@domain.com\n"
    "TAG:\n\n"
    "Submitted data:\n"
    "First name: MyName\n"
    "Last name:\n"
    "Email: email@domain.com")
 
print(re.findall(regex, s))

Output Output

['email@domain.com', 'email@domain.com']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM