简体   繁体   中英

Unable to exclude unwanted file extensions while grabbing emails using regex

I've written a script in python using regular expression to grab email address from certain websites. I've used selenium as few of the sites are dynamic. However, my script is doing fine as long as there is no such file extensions resembling email available in those pages, as in himalayan-institute-logo@2x.png .

How can I exclude extensions ending with .png or .jpg while grabbing emails?

Regex pattern I've made use of:

[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+

Script I'm trying with:

import re
from selenium import webdriver

URLS = (
    'https://www.himalayaninstitute.org/about/',
    'http://www.innovaprint.com.sg/',
    'http://www.cityscape.com.sg/?page_id=37',
    'http://www.yogaville.org',
    )

def get_email(driver,link):
    driver.get(link)
    email = re.findall(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+',driver.page_source)
    if email: 
        print(link,email[0])
    else: 
        print(link)

if __name__ == '__main__':
    chromeOptions = webdriver.ChromeOptions()
    chromeOptions.add_argument("--headless")
    driver = webdriver.Chrome(chrome_options=chromeOptions)
    for url in URLS:
        get_email(driver,url)
    driver.quit()

Output I'm having:

https://www.himalayaninstitute.org/about/ himalayan-institute-logo@2x.png
http://www.innovaprint.com.sg/ info@innovacoms.com
http://www.cityscape.com.sg/?page_id=37 info@cityscape.com.sg
http://www.yogaville.org Yantra-@500.png

The last part [a-zA-Z0-9-.]+ is a broad match which does not take the position of the dot into account. It could for example also match .....

One possibility could be to still use the first part of your pattern [a-zA-Z0-9_.+-]+@ to match including the @ sign.

Then use a positive lookahead to assert what is on the right does not end with .png or .jpg and match a pattern where the dot is between at least 1 character that is not a dot.

[a-zA-Z0-9_.+-]+@[a-zA-Z0-9]+(?:\.[a-zA-Z0-9]+)*(?!\.(?:png|jpg))\.[a-zA-Z0-9]+

Explanation

  • [a-zA-Z0-9_.+-]+@ Match allowed character followed by @
  • [a-zA-Z0-9]+ Match any of the listed in the character class
  • (?: Non capturing group
    • \\.[a-zA-Z0-9]+ Match a dot followed by 1+ times what is listed in the character class
  • )* Close non capturing group and repeat 0+ times
  • (?! Negative lookahead, assert what follows is not
    • \\.(?:png|jpg) Match .png or .jog
  • )\\.[a-zA-Z0-9]+ Close lookahead and match 1+ times a dot and what is listed in the character class

Regex demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM