Python parse email address with regex

Question

I'm a beginner on regex of python

target test.php code:

<html>
  <head></head> 
  <body>
    <a href="www.google.com">josn2051@yahoo.com.tw</a>
    <div>john@yahoo.com.tw</div>
    testtest321@gmail.com
    chorm3636@test.test.test.com
  </body>
</html>

This is my code:

import requests,re

email_pattern = re.compile('([\w\-\.]+@(\w[\w\-]+\.)+[\w\-]+)')

res = requests.get("http://127.0.0.1/test.php")

a = email_pattern.findall(res.text)

print a

The result :

[(u'josn2051@yahoo.com.tw', u'com.'), (u'john@yahoo.com.tw', u'com.'), (u'asdfFGw@gmail.com', u'gmail.'), (u'chorm3636@test.test.test.com', u'test.')]

But I want the result like:

[josn2051@yahoo.com.us, john@yahoo.com.us, testtest321@gmail.com, chorm3636@test.test.test.com]

What wrong in my pattern or code ?

Why the result is multiple list containse extra com , gmail , test ?

Thank you solve my doubts !

Answer 1

First rule is that you do never use regexp to parse HTML, it is impossible to do it right!

Once you have a block of text that you want to validate as being and email address, you google and find 2-5 very good regexps on StackOverlfow. RegExps are not python specific.

3rd, you look for a better job, trying to scrap email addresses from websites is not an easy task and everyone here hate those that are spamming us.

Answer 2

Make the inner group non-capturing :

([\w\-\.]+@(?:\w[\w\-]+\.)+[\w\-]+)
            ^^

Python parse email address with regex

Question

2 answers

solution1
2 2016-02-20 15:27:03

solution2
1 ACCPTED 2016-02-20 15:02:09

Python parse email address with regex

Question

2 answers

solution1 2 2016-02-20 15:27:03

solution2 1 ACCPTED 2016-02-20 15:02:09

solution1
2 2016-02-20 15:27:03

solution2
1 ACCPTED 2016-02-20 15:02:09