I'm a beginner on regex of python
target test.php
code:
<html>
<head></head>
<body>
<a href="www.google.com">josn2051@yahoo.com.tw</a>
<div>john@yahoo.com.tw</div>
testtest321@gmail.com
chorm3636@test.test.test.com
</body>
</html>
This is my code:
import requests,re
email_pattern = re.compile('([\w\-\.]+@(\w[\w\-]+\.)+[\w\-]+)')
res = requests.get("http://127.0.0.1/test.php")
a = email_pattern.findall(res.text)
print a
The result :
[(u'josn2051@yahoo.com.tw', u'com.'), (u'john@yahoo.com.tw', u'com.'), (u'asdfFGw@gmail.com', u'gmail.'), (u'chorm3636@test.test.test.com', u'test.')]
But I want the result like:
[josn2051@yahoo.com.us, john@yahoo.com.us, testtest321@gmail.com, chorm3636@test.test.test.com]
What wrong in my pattern or code ?
Why the result is multiple list containse extra com
, gmail
, test
?
Thank you solve my doubts !
First rule is that you do never use regexp to parse HTML, it is impossible to do it right!
Once you have a block of text that you want to validate as being and email address, you google and find 2-5 very good regexps on StackOverlfow. RegExps are not python specific.
3rd, you look for a better job, trying to scrap email addresses from websites is not an easy task and everyone here hate those that are spamming us.
Make the inner group non-capturing :
([\w\-\.]+@(?:\w[\w\-]+\.)+[\w\-]+)
^^
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.