简体   繁体   中英

Extracting email addresses in an html block in ruby/rails

I am creating a parser that wards off against spamming and harvesting of emails from a block of text that comes from tinyMCE (so it may or may not have html tags in it)

I've tried regexes and so far this has been successful:

/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i

problem is, i need to ignore all email addresses with mailto hrefs. for example:

<a href="mailto:test@mail.com">test@mail.com</a>

should only return the second email add.

To get a background of what im doing, im reversing the email addresses in a block so the above example would look like this:

<a href="mailto:test@mail.com">moc.liam@tset</a>

problem with my current regex is that it also replaces the one in href. Is there a way for me to do this with a single regex? Or do i have to check for one then the other? Is there a way for me to do this just by using gsub or do I have to use some nokogiri/hpricot magicks and whatnot to parse the mailtos? Thanks in advance!

Here were my references btw:

so.com/questions/504860/extract-email-addresses-from-a-block-of-text

so.com/questions/1376149/regexp-for-extracting-a-mailto-address

im also testing using this:

http://rubular.com/

edit

here's my current helper code:

def email_obfuscator(text)
  text.gsub(/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i) { |m|
    m = "<span class='anti-spam'>#{m.reverse}</span>"
  }
end

which results in this:

<a target="_self" href="mailto:<span class='anti-spam'>moc.liamg@tset</span>"><span class="anti-spam">moc.liamg@tset</span></a>

Would this work?

/\b(?<!mailto:)[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i

The (?<!mailto:) is a negative lookbehind, which will ignore any matches starting with mailto:

I don't have Ruby set up at work, unfortunately, but it worked with PHP when I tested it...

Another option if lookbehind doesn't work:

/\\b(mailto:)?([A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[AZ]{2,4})\\b/i

This would match all emails, then you can manually check if first captured group is "mailto:" then skip this match.

Why not just store all the matched emails in an array and remove any duplicates ? You can do this easily with the ruby standard library and (I imagine) it's probably quicker/more maintainable than adding more complexity to your regex.

emails = ["email_one@example.com", "email_one@example.com", "email_two@example.com"]
emails.uniq # => ["email_one@example.com", "email_two@example.com"]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM