简体   繁体   中英

Python script to extract emails from .text file

I am currently attempting to run a script that extracts all the emails from a .txt file. When running the script, I get an invalid syntax error. Perhaps someone can help...

import re
in_file = open("C:\\Users\\Testing1_Emails.txt","rt")


for line in in_file:
    if re.match(r'[\w\.-]+@[\w\.-]+')
        print line

you have to write:

if re.match(r'[\w\.-]+@[\w\.-]+',  line):

(add 'line' and ':')

The issue lies here:

for line in in_file:
    if re.match(r'[\w\.-]+@[\w\.-]+')
        print line

In the if re.match(r'[\\w\\.-]+@[\\w\\.-]+') statement, you don't end with :

match method requires 2 arguments.

see at : https://docs.python.org/2/library/re.html#re.match

re.match(pattern, string, flags=0)

If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match.

Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.

If you want to locate a match anywhere in string, use search() instead (see also search() vs. match()).

most mail IDs allow alphabets, numbers, dot(.), underscores(_) and all of them contain "@" for sure. we can use this information to write a pattern using regex.

import re
pat = re.compile(r'[a-zA-Z0-9\._]+@[a-zA-Z\.]') # regex pattern

[az]+ will match any lower case alphabet, any number of occurence
[0-9]+ will match any digit, any number of occurence
[.] will match '.'

Further, if you want to check that your pattern matches your search strings, check it out here. https://regexr.com/

example:--

f = open("my_file.txt", "w")
f.write('walkup@cs.washington.edu, geb@cs.pitt.edu, walkup@cs.washington.edu \n')
mails = re.findall(r"[a-z]+@[a-z\.]+", f.read())
print(list(set(mails)))

out: ['walkup@cs.washington.edu', 'geb@cs.pitt.edu', 'walkup@cs.washington.edu']

note: re.findall() applies re.pattern() internally over the specified pattern.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM