简体   繁体   中英

Enhance a php regular expression to not include values from specific html attributes

I have the following php regular expression that I use to select/extract plain text emails from html pages:

/(^[^<\s? input ].*)(?<=[^\w\d\+_.:-])(?:[-!#$%&*+\/=?^_`.{|}~\w\x80-\xFF]+|".*?")\@(?:[-a-z0-9\x80-\xFF]+(?:\.[-a-z0-9\x80-\xFF]+)*\.[a-z]+|\[[\d.a-fA-F:]+\])(?!(?>[^<]*(?:<(?!\/?a\b)[^<]*)*)<\/a>)/i

The problem is that it selects also emails from html attributes like value="somemail@something.com" or placeholder='somemail@someserver.org' which I don't want that. So I try to modify/enhance it, in order to exclude the attributes.

The following sentence is ok:

<p>hello my name is  etsefefsda@gmail.com and thats it.</p>

The following four should be excluded from the selection (notice the single, double and no quotes after the equal sign):

<p data-email='an_email@here.com'
<input value="someone@yahoo.co.uk"
<input placeholder="someone@preosmail.com"
<input placeholder=someone@servermail.com

Any ideas on how to do it?

Thank you in advance

Assuming a valid email never comes after an unclosed < , try a variation of the following:

<[^>]+@(*SKIP)(*FAIL)|@

Explanation

  • <[^>]+ Finds a < followed by no >
  • @ Finds an @
  • (*SKIP)(*FAIL) YE SHALL NOT PASS. With "ye" meaning an email within an unclosed tag.

  • |@ Find any correct email addresses.

    • You should replace the @ sign with your regex to find emails. I have it as a placeholder here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM