简体   繁体   中英

Can't scrape from a specific website using python requests

I'm trying to scrape from this URL below but it's not brgingging the content I see when I access using a browser (the content from a public customer case/story). I tried also simulating a real browser with headers, but nothing so far. Any tip for me?

URL: https://customers.microsoft.com/en-us/story/767633-asos-retailer-azure-active-directory-m365

import requests
main_url = "https://customers.microsoft.com/en-us/story/767633-asos-retailer-azure-active-directory-m365"
result = requests.get(main_url)   
print(result.text)

It uses an external API to get the data. You just need to make a call on:

GET https://customers.microsoft.com/en-us/api/search?key=STORY_KEY

STORY_KEY is 767633-asos-retailer-azure-active-directory-m365 eg the text after the last slash in the url. You could use a script like the following:

import requests

url = "https://customers.microsoft.com/en-us/story/767633-asos-retailer-azure-active-directory-m365"

r = requests.get(
    "https://customers.microsoft.com/en-us/api/search",
    params = {
        "key": url.rsplit('/', 1)[1]
    }
)
document = r.json()["search_document"]

summary = document["story_exec_summary"]
body = document["story_body_text_2"]
quote1 = document["story_quote_carousel"]
quote2 = document["story_quote_carousel_2"]

print(summary)
print(body)
print(quote1)
print(quote2)

Note that you would need to search what data you are looking for in the document object (videos, body3 etc...)

You would need to handle certificates properly. It would require additional packages:

pip install certifi
pip install urllib3

And we need to use different python library, ie urllib3

python
Python 3.7.7 (default, Mar 10 2020, 15:43:33)
[Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> import certifi
>>> import urllib3
>>>
>>> http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
>>> main_url = "https://customers.microsoft.com/en-us/story/767633-asos-retailer-azure-active-directory-m365"
>>>
>>> r = http.request('GET', main_url)
>>> r.status
200
>>> r.data

>>> open("stories.html", "wb").write(r.data)

Output:

>>> r.data
b'\r\n<!doctype html>\r\n<html lang="en" xml:lang="en" dir="ltr">\r\n<head prefix="og: http://ogp.me/ns#">\r\n    <meta charset="utf-8" />\r\n    <meta name="viewport" content="width=device-width, initial-scale=1.0" />\r\n    <meta name="description" content="Microsoft customer stories. See how Microsoft tools help companies run their business.">\r\n    <meta name="keywords" content="Microsoft, customers, stories, business, software, tools, services, use case, global, collaboration, vendor, story sear .....

Let me know if this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM