简体   繁体   中英

Using grep to find all emails

How to properly construct regular expression for "grep" linux program, to find all email in, say /etc directory? Currently, my script is following:

grep -srhw "[[:alnum:]]*@[[:alnum:]]*" /etc

It working OK - a see some of the emails, but when i modify it, to catch the one-or-more charactes before- and after the "@" sign...

grep -srhw "[[:alnum:]]+@[[:alnum:]]+" /etc

.. it stops working at all

Also, it does't catches emails of form "Name.LastName@site.com"

Help !

Here is another example

grep -Eiorh '([[:alnum:]_.-]+@[[:alnum:]_.-]+?\.[[:alpha:].]{2,6})' "$@" * | sort | uniq > emails.txt

This variant works with 3 level domains.

grep requires most of the regular expression special characters to be escaped - including + . You'll want to do one of these two:

grep -srhw "[[:alnum:]]\+@[[:alnum:]]\+" /etc

egrep -srhw "[[:alnum:]]+@[[:alnum:]]+" /etc

I modified your regex to include punctuation (like .-_ etc) by changing it to

egrep -ho "[[:graph:]]+@[[:graph:]]+"

This still is pretty clean and matches... well, most anything with an @ in it, of course. Also 3rd level domains, also addresses with '%' or '+' in them. See http://www.delorie.com/gnu/docs/grep/grep_8.html for a good documentation on the character class used.

In my example, the addresses were surrounded by white space, making matching quite easy. If you grep through a mail server log for example, you can add < > to make it match only the addresses:

egrep -ho "<[[:graph:]]+@[[:graph:]]+>"

@thomas, @glowcoder and @oedo all are right. The RFC that defines how an eMail address can look is quite a fun read. (I've been using GNU grep 2.9 above, included in Ubuntu).

Also check out zpea's version below, it should make for a less trigger-happy matcher.

I have used this one in order to filter email address identified by 'at' symbol and isolated by white spaces within a text:

egrep -o "[^[:space:]]+@[^[:space:]]+" | tr -d "<>"

Of course, you can use grep -E instead egrep (extended grep). Note that tr command is used to remove typical email delimiters.

grep -E -o -r "[A-Za-z0-9][A-Za-z0-9._%+-]+@[A-Za-z0-9][A-Za-z0-9.-]+\\.[A-Za-z]{2,6}" /etc

This is adapted from an answer that is not mine originally, but I found it super helpful. It's from here:

http://www.shellhacks.com/en/RegEx-Find-Email-Addresses-in-a-File-using-Grep

They suggest:

grep -E -o -r "\\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,6}\\b" /etc

But it has certain false positives, like '+person..@example.com' or 'person@..com', and the whitespace constraints miss things like "mailto:person@example.com" (not technically an email but contains one); so I tweaked it a little bit.

(Do what you want with the options to grep, I don't know them very well)

这个递归对我很有用:

grep -rIhEo "\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b" /etc/*

只是想提一下,这个稍微的变化非常适合从 Twitter 推文之类的内容中获取提及:

grep -Eiorh '(@[[:alnum:]_.-]+)' "$@" * | sort | uniq -c

似乎有效,但使用@获取文件名

egrep -osrwh "[[:alnum:]._%+-]+@[[:alnum:]]+\.[a-zA-Z]{2,6}" ~/.thunderbird/

I Bet There Are No Best Base Regex Exists Than This One

egrep -o "[a-zA-Z0-9\_\.\+\%\-]{1,}\@[a-zA-Z0-9\_\.\+\%\-]{1,}\.[a-zA-Z0-9\_\.\+\%\-]{1,}"

It Will Not Leave A Single Email From The Garbage But The Thing You Must Have To Do Is, Extract If Something Same As Email But Not Email, Like home_mobile@1x.png , Either It Needs Manual Lookup Or Make My Mentioned Regex More Specific Towards What You Want Add More Special Characters But There Are No Base Regex Exists Which Is Better Than This

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM