I'm trying to fetch an email from a webpage using requests module. The problem is, the email address seems to be encoded or something, which is why it is unreadable, and I wish to decode it in its usual form.
import requests
from bs4 import BeautifulSoup
link = 'https://global-standard.org/find-suppliers-shops-and-inputs/certified-suppliers/database/search_result/38996'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
}
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"html.parser")
email = soup.select_one("script[type='text/javascript']:-soup-contains('emailProtector')").contents[0]
print(email)
When I run the above script, the following is what I get:
emailProtector.addCloakedMailto("ep_586c4771", 1);
This is the result I'm after:
fttextilegroup2017@gmail.com
You can try:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://global-standard.org/find-suppliers-shops-and-inputs/certified-suppliers/database/search_result/38996'
def decloak(cloaked_tag, attr_name):
a, b = "" , ""
for span in cloaked_tag.select('span'):
for attr in span.attrs:
if attr == attr_name:
a += span[attr]
else:
b = span[attr] + b
return a + b
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
attr_name = re.search(r'nodeName\.toLowerCase\(\)\.indexOf\("(.*?)"', str(soup)).group(1)
mail = decloak(soup.select_one('.cloaked_email'), attr_name)
print(mail)
Prints:
fttextilegroup2017@gmail.com
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.