简体   繁体   English

使用正则表达式抓取电子邮件时无法排除不需要的文件扩展名

[英]Unable to exclude unwanted file extensions while grabbing emails using regex

I've written a script in python using regular expression to grab email address from certain websites. 我已经使用regular expression在python中编写了一个脚本,以从某些网站获取电子邮件地址。 I've used selenium as few of the sites are dynamic. 我使用硒,因为很少有网站是动态的。 However, my script is doing fine as long as there is no such file extensions resembling email available in those pages, as in himalayan-institute-logo@2x.png . 但是,只要在这些页面中没有类似于电子邮件的文件扩展名,我的脚本就可以正常工作,例如himalayan-institute-logo@2x.png

How can I exclude extensions ending with .png or .jpg while grabbing emails? 如何在抓取电子邮件时排除以.png.jpg结尾的扩展名?

Regex pattern I've made use of: 我使用过的正则表达式模式:

[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+

Script I'm trying with: 我正在尝试的脚本:

import re
from selenium import webdriver

URLS = (
    'https://www.himalayaninstitute.org/about/',
    'http://www.innovaprint.com.sg/',
    'http://www.cityscape.com.sg/?page_id=37',
    'http://www.yogaville.org',
    )

def get_email(driver,link):
    driver.get(link)
    email = re.findall(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+',driver.page_source)
    if email: 
        print(link,email[0])
    else: 
        print(link)

if __name__ == '__main__':
    chromeOptions = webdriver.ChromeOptions()
    chromeOptions.add_argument("--headless")
    driver = webdriver.Chrome(chrome_options=chromeOptions)
    for url in URLS:
        get_email(driver,url)
    driver.quit()

Output I'm having: 我有的输出:

https://www.himalayaninstitute.org/about/ himalayan-institute-logo@2x.png
http://www.innovaprint.com.sg/ info@innovacoms.com
http://www.cityscape.com.sg/?page_id=37 info@cityscape.com.sg
http://www.yogaville.org Yantra-@500.png

The last part [a-zA-Z0-9-.]+ is a broad match which does not take the position of the dot into account. 最后一部分[a-zA-Z0-9-.]+是广泛匹配,没有考虑点的位置。 It could for example also match ..... 例如,它也可以匹配.....

One possibility could be to still use the first part of your pattern [a-zA-Z0-9_.+-]+@ to match including the @ sign. 一种可能是仍然使用模式的第一部分[a-zA-Z0-9_.+-]+@进行匹配,包括@符号。

Then use a positive lookahead to assert what is on the right does not end with .png or .jpg and match a pattern where the dot is between at least 1 character that is not a dot. 然后使用正向前瞻断言右边的内容不以.png或.jpg结尾,并匹配点至少在1个非点字符之间的模式。

[a-zA-Z0-9_.+-]+@[a-zA-Z0-9]+(?:\.[a-zA-Z0-9]+)*(?!\.(?:png|jpg))\.[a-zA-Z0-9]+

Explanation 说明

  • [a-zA-Z0-9_.+-]+@ Match allowed character followed by @ [a-zA-Z0-9_.+-]+@允许匹配的字符后跟@
  • [a-zA-Z0-9]+ Match any of the listed in the character class [a-zA-Z0-9]+匹配角色类中列出的任何一个
  • (?: Non capturing group (?:非捕获组
    • \\.[a-zA-Z0-9]+ Match a dot followed by 1+ times what is listed in the character class \\.[a-zA-Z0-9]+匹配一个点,后跟1+倍字符类中列出的值
  • )* Close non capturing group and repeat 0+ times )*关闭非捕获组并重复0次以上
  • (?! Negative lookahead, assert what follows is not (?!负向前看,断言以下内容不是
    • \\.(?:png|jpg) Match .png or .jog \\.(?:png|jpg)匹配.png或.jog
  • )\\.[a-zA-Z0-9]+ Close lookahead and match 1+ times a dot and what is listed in the character class )\\.[a-zA-Z0-9]+关闭并匹配1+次点和字符类中列出的内容

Regex demo 正则表达式演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM