I've written a script in python using regular expression
to grab email address from certain websites. I've used selenium as few of the sites are dynamic. However, my script is doing fine as long as there is no such file extensions resembling email available in those pages, as in himalayan-institute-logo@2x.png
.
How can I exclude extensions ending with .png
or .jpg
while grabbing emails?
Regex pattern I've made use of:
[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+
Script I'm trying with:
import re
from selenium import webdriver
URLS = (
'https://www.himalayaninstitute.org/about/',
'http://www.innovaprint.com.sg/',
'http://www.cityscape.com.sg/?page_id=37',
'http://www.yogaville.org',
)
def get_email(driver,link):
driver.get(link)
email = re.findall(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+',driver.page_source)
if email:
print(link,email[0])
else:
print(link)
if __name__ == '__main__':
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
driver = webdriver.Chrome(chrome_options=chromeOptions)
for url in URLS:
get_email(driver,url)
driver.quit()
Output I'm having:
https://www.himalayaninstitute.org/about/ himalayan-institute-logo@2x.png
http://www.innovaprint.com.sg/ info@innovacoms.com
http://www.cityscape.com.sg/?page_id=37 info@cityscape.com.sg
http://www.yogaville.org Yantra-@500.png
The last part [a-zA-Z0-9-.]+
is a broad match which does not take the position of the dot into account. It could for example also match .....
One possibility could be to still use the first part of your pattern [a-zA-Z0-9_.+-]+@
to match including the @ sign.
Then use a positive lookahead to assert what is on the right does not end with .png or .jpg and match a pattern where the dot is between at least 1 character that is not a dot.
[a-zA-Z0-9_.+-]+@[a-zA-Z0-9]+(?:\.[a-zA-Z0-9]+)*(?!\.(?:png|jpg))\.[a-zA-Z0-9]+
Explanation
[a-zA-Z0-9_.+-]+@
Match allowed character followed by @ [a-zA-Z0-9]+
Match any of the listed in the character class (?:
Non capturing group
\\.[a-zA-Z0-9]+
Match a dot followed by 1+ times what is listed in the character class )*
Close non capturing group and repeat 0+ times (?!
Negative lookahead, assert what follows is not
\\.(?:png|jpg)
Match .png or .jog )\\.[a-zA-Z0-9]+
Close lookahead and match 1+ times a dot and what is listed in the character class
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.