简体   繁体   中英

Can't scrape a certain field from a webpage using requests even when that very field is available in page source

I'm trying to scrape email address from a webpage. The email address is available in page source (ctrl + u). However, I still can't fetch it using requests. All I get is AttributeError. Any help on this would be appreciated.

webpage link

My current attempt:

import requests
from bs4 import BeautifulSoup

link = "https://www.facebook.com/pg/theultimatecollectionco/about/?ref=page_internal"

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36'
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    try:
        email = soup.select_one("a[href^='mailto:']").get("href")
    except AttributeError: email = ""
    print(email)

The page is constructed with with help of Javascript, so BeautifulSoup alone cannot see it ( selenium helps here).

The easiest way is just to grep the page for any mailto: hrefs:

import re
import html
import requests


link = "https://www.facebook.com/pg/theultimatecollectionco/about/?ref=page_internal"

html_doc = requests.get(link).text
for email in re.findall(r'"mailto:([^"]+)"', html_doc):
    print(html.unescape(email))

Prints:

support@theultimatecollection.co

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM