I have the following php regular expression that I use to select/extract plain text emails from html pages:
/(^[^<\s? input ].*)(?<=[^\w\d\+_.:-])(?:[-!#$%&*+\/=?^_`.{|}~\w\x80-\xFF]+|".*?")\@(?:[-a-z0-9\x80-\xFF]+(?:\.[-a-z0-9\x80-\xFF]+)*\.[a-z]+|\[[\d.a-fA-F:]+\])(?!(?>[^<]*(?:<(?!\/?a\b)[^<]*)*)<\/a>)/i
The problem is that it selects also emails from html attributes like value="somemail@something.com" or placeholder='somemail@someserver.org' which I don't want that. So I try to modify/enhance it, in order to exclude the attributes.
The following sentence is ok:
<p>hello my name is etsefefsda@gmail.com and thats it.</p>
The following four should be excluded from the selection (notice the single, double and no quotes after the equal sign):
<p data-email='an_email@here.com'
<input value="someone@yahoo.co.uk"
<input placeholder="someone@preosmail.com"
<input placeholder=someone@servermail.com
Any ideas on how to do it?
Thank you in advance
Assuming a valid email never comes after an unclosed <
, try a variation of the following:
<[^>]+@(*SKIP)(*FAIL)|@
<[^>]+
Finds a <
followed by no >
@
Finds an @
(*SKIP)(*FAIL)
YE SHALL NOT PASS. With "ye" meaning an email within an unclosed tag.
|@
Find any correct email addresses.
@
sign with your regex to find emails. I have it as a placeholder here.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.