简体   繁体   English

正则表达式用于处理文件中的电子邮件

[英]Regex for greping emails in file

I would like to validate emails from text files in a directory using bash . 我想使用bash验证目录中文本文件中的电子邮件。

My regex: 我的正则表达式:

grep -Eoh \
         "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,8}\b" * \
         | sort -u > mail_list

This regex satisfies all my requirements but it cannot exclude addresses such: 此正则表达式可以满足我的所有要求,但不能排除以下地址:

^%&blah@gmail.com

and

with.dot@sale..department.company-name.com

(with 2 and more dots). (带有2个或更多点)。

These kinds of addresses should be excluded. 这类地址应排除在外。

How can I modify this regex to exclude these types of emails? 如何修改此正则表达式以排除这些类型的电子邮件?
I can use only one expression for this task. 对于此任务,我只能使用一个表达式。

The email address ^%&blah@gmail.com is actually a valid email address . 电子邮件地址^%&blah@gmail.com实际上是有效的电子邮件地址

You can do this in Perl using the Email::Valid module (this assumes that each entry is on a new line): 您可以在Perl中使用Email::Valid模块执行此操作(假定每个条目都在新行上):

perl -MEmail::Valid -ne 'print if Email::Valid->address($_)' file1 file2

file1 文件1

not email
abc@test.com

file2 文件2

not email
def@test.com
^%&blah@gmail.com
with.dot@sale..department.company-name.com

output 产量

abc@test.com
def@test.com
^%&blah@gmail.com

Try this regex: 试试这个正则表达式:

'\b[A-Za-z0-9]+[A-Za-z0-9._%+-]+@([A-Za-z0-9-]+\.)+[A-Za-z]{2,8}\b'

I added an alphanumeric group to the front, to force emails to begin with at least one letter or number, after which they may also have symbols. 我在前面添加了一个字母数字组,以强制电子邮件以至少一个字母或数字开头,之后它们还可能带有符号。

After the @ sign, I added a group which can contain any number of letters or numbers, followed by one period. @符号后,我添加了一个组,该组可以包含任意数量的字母或数字,后跟一个句点。 However, this group can be repeated multiple times, thus being able to match long.domain.name.com . 但是,该组可以重复多次,因此可以匹配long.domain.name.com

Finally, the regex ends with the final string as you had it, for example 'com' . 最后,正则表达式以您拥有的最终字符串结尾,例如'com'


Update: 更新:

Since \\b matches a word boundary, and the symbols ^%& are not considered part of the word 'blah', the above will still match blah@gmail.com even though it is preceded by undesired characters. 由于\\b匹配单词边界,并且符号^%&不被视为单词'blah'的一部分,即使上面的blah@gmail.com不需要的字符,它们仍然匹配。 To avoid this, use a Negative Lookbehind . 为避免这种情况,请使用Negative Lookbehind This will require using grep -P instead of -E : 这将需要使用grep -P而不是-E

grep -P '(?<![%&^])\b[A-Za-z0-9]+[A-Za-z0-9._%+-]+@([A-Za-z0-9-]+\.)+[A-Za-z]{2,8}\b'

The (?<![%&^]) tells regex to match further only if the string is not preceded by the characters %&^ . (?<![%&^])告诉正则表达式仅在字符串前面没有字符%&^时才进行进一步匹配。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM