Extracting email addresses in an html block in ruby/rails

Question

I am creating a parser that wards off against spamming and harvesting of emails from a block of text that comes from tinyMCE (so it may or may not have html tags in it)

I've tried regexes and so far this has been successful:

/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i

problem is, i need to ignore all email addresses with mailto hrefs. for example:

<a href="mailto:test@mail.com">test@mail.com</a>

should only return the second email add.

To get a background of what im doing, im reversing the email addresses in a block so the above example would look like this:

<a href="mailto:test@mail.com">moc.liam@tset</a>

problem with my current regex is that it also replaces the one in href. Is there a way for me to do this with a single regex? Or do i have to check for one then the other? Is there a way for me to do this just by using gsub or do I have to use some nokogiri/hpricot magicks and whatnot to parse the mailtos? Thanks in advance!

Here were my references btw:

so.com/questions/504860/extract-email-addresses-from-a-block-of-text

so.com/questions/1376149/regexp-for-extracting-a-mailto-address

im also testing using this:

http://rubular.com/

edit

here's my current helper code:

def email_obfuscator(text)
  text.gsub(/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i) { |m|
    m = "<span class='anti-spam'>#{m.reverse}</span>"
  }
end

which results in this:

<a target="_self" href="mailto:<span class='anti-spam'>moc.liamg@tset</span>"><span class="anti-spam">moc.liamg@tset</span></a>

Answer 1

Would this work?

/\b(?<!mailto:)[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i

The (?<!mailto:) is a negative lookbehind, which will ignore any matches starting with mailto:

I don't have Ruby set up at work, unfortunately, but it worked with PHP when I tested it...

Answer 2

Another option if lookbehind doesn't work:

/\\b(mailto:)?([A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[AZ]{2,4})\\b/i

This would match all emails, then you can manually check if first captured group is "mailto:" then skip this match.

Answer 3

Why not just store all the matched emails in an array and remove any duplicates ? You can do this easily with the ruby standard library and (I imagine) it's probably quicker/more maintainable than adding more complexity to your regex.

emails = ["email_one@example.com", "email_one@example.com", "email_two@example.com"]
emails.uniq # => ["email_one@example.com", "email_two@example.com"]

Extracting email addresses in an html block in ruby/rails

Question

3 answers

solution1
0 2010-05-06 15:21:45

solution2
0 ACCPTED 2010-05-06 15:29:16

solution3
0 2010-05-06 16:51:48

Extracting email addresses in an html block in ruby/rails

Question

3 answers

solution1 0 2010-05-06 15:21:45

solution2 0 ACCPTED 2010-05-06 15:29:16

solution3 0 2010-05-06 16:51:48

solution1
0 2010-05-06 15:21:45

solution2
0 ACCPTED 2010-05-06 15:29:16

solution3
0 2010-05-06 16:51:48