简体   繁体   中英

Python parse email address with regex

I'm a beginner on regex of python

target test.php code:

<html>
  <head></head> 
  <body>
    <a href="www.google.com">josn2051@yahoo.com.tw</a>
    <div>john@yahoo.com.tw</div>
    testtest321@gmail.com
    chorm3636@test.test.test.com
  </body>
</html>

This is my code:

import requests,re

email_pattern = re.compile('([\w\-\.]+@(\w[\w\-]+\.)+[\w\-]+)')

res = requests.get("http://127.0.0.1/test.php")

a = email_pattern.findall(res.text)

print a

The result :

[(u'josn2051@yahoo.com.tw', u'com.'), (u'john@yahoo.com.tw', u'com.'), (u'asdfFGw@gmail.com', u'gmail.'), (u'chorm3636@test.test.test.com', u'test.')]

But I want the result like:

[josn2051@yahoo.com.us, john@yahoo.com.us, testtest321@gmail.com, chorm3636@test.test.test.com]

What wrong in my pattern or code ?

Why the result is multiple list containse extra com , gmail , test ?

Thank you solve my doubts !

First rule is that you do never use regexp to parse HTML, it is impossible to do it right!

Once you have a block of text that you want to validate as being and email address, you google and find 2-5 very good regexps on StackOverlfow. RegExps are not python specific.

3rd, you look for a better job, trying to scrap email addresses from websites is not an easy task and everyone here hate those that are spamming us.

Make the inner group non-capturing :

([\w\-\.]+@(?:\w[\w\-]+\.)+[\w\-]+)
            ^^

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM