正则表达式用于处理文件中的电子邮件

Question

I would like to validate emails from text files in a directory using bash . 我想使用bash验证目录中文本文件中的电子邮件。

My regex: 我的正则表达式：

grep -Eoh \
         "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,8}\b" * \
         | sort -u > mail_list

This regex satisfies all my requirements but it cannot exclude addresses such: 此正则表达式可以满足我的所有要求，但不能排除以下地址：

^%&blah@gmail.com

and 和

with.dot@sale..department.company-name.com

(with 2 and more dots). （带有2个或更多点）。

These kinds of addresses should be excluded. 这类地址应排除在外。

How can I modify this regex to exclude these types of emails? 如何修改此正则表达式以排除这些类型的电子邮件？
I can use only one expression for this task. 对于此任务，我只能使用一个表达式。

Answer 1

The email address ^%&blah@gmail.com is actually a valid email address . 电子邮件地址^%&blah@gmail.com实际上是有效的电子邮件地址。

You can do this in Perl using the Email::Valid module (this assumes that each entry is on a new line): 您可以在Perl中使用Email::Valid模块执行此操作（假定每个条目都在新行上）：

perl -MEmail::Valid -ne 'print if Email::Valid->address($_)' file1 file2

file1 文件1

not email
abc@test.com

file2 文件2

not email
def@test.com
^%&blah@gmail.com
with.dot@sale..department.company-name.com

output 产量

abc@test.com
def@test.com
^%&blah@gmail.com

Answer 2

Try this regex: 试试这个正则表达式：

'\b[A-Za-z0-9]+[A-Za-z0-9._%+-]+@([A-Za-z0-9-]+\.)+[A-Za-z]{2,8}\b'

I added an alphanumeric group to the front, to force emails to begin with at least one letter or number, after which they may also have symbols. 我在前面添加了一个字母数字组，以强制电子邮件以至少一个字母或数字开头，之后它们还可能带有符号。

After the @ sign, I added a group which can contain any number of letters or numbers, followed by one period. @符号后，我添加了一个组，该组可以包含任意数量的字母或数字，后跟一个句点。 However, this group can be repeated multiple times, thus being able to match long.domain.name.com . 但是，该组可以重复多次，因此可以匹配long.domain.name.com 。

Finally, the regex ends with the final string as you had it, for example 'com' . 最后，正则表达式以您拥有的最终字符串结尾，例如'com' 。

Update: 更新：

Since \\b matches a word boundary, and the symbols ^%& are not considered part of the word 'blah', the above will still match blah@gmail.com even though it is preceded by undesired characters. 由于\\b匹配单词边界，并且符号^%&不被视为单词'blah'的一部分，即使上面的blah@gmail.com不需要的字符，它们仍然匹配。 To avoid this, use a Negative Lookbehind . 为避免这种情况，请使用Negative Lookbehind 。 This will require using grep -P instead of -E : 这将需要使用grep -P而不是-E ：

grep -P '(?<![%&^])\b[A-Za-z0-9]+[A-Za-z0-9._%+-]+@([A-Za-z0-9-]+\.)+[A-Za-z]{2,8}\b'

The (?<![%&^]) tells regex to match further only if the string is not preceded by the characters %&^ . (?<![%&^])告诉正则表达式仅在字符串前面没有字符%&^时才进行进一步匹配。

正则表达式用于处理文件中的电子邮件

问题描述

2 个解决方案

解决方案1
1 2014-06-05 10:18:34

file1 文件1

file2 文件2

output 产量

解决方案2
0 已采纳 2014-06-05 09:41:31

Update: 更新：

正则表达式用于处理文件中的电子邮件

问题描述

2 个解决方案

解决方案1 1 2014-06-05 10:18:34

file1 文件1

file2 文件2

output 产量

解决方案2 0 已采纳 2014-06-05 09:41:31

Update: 更新：

解决方案1
1 2014-06-05 10:18:34

解决方案2
0 已采纳 2014-06-05 09:41:31