简体   繁体   中英

Difficulties with JSON using Python's requests library

When attempting to scrape producthunt,

import requests

headers = {
    'authority': 'www.producthunt.com',
    'pragma': 'no-cache',
    'cache-control': 'no-cache',
    'upgrade-insecure-requests': '1',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'sec-fetch-site': 'none',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-user': '?1',
    'sec-fetch-dest': 'document',
    'accept-language': 'en-US,en;q=0.9',
}

response = requests.get('https://www.producthunt.com/', headers=headers)

I found that the returned response doesn't have a valid string to convert to json. After trying to replace the type of quote with response.text.replace() and to return the json with json.loads(re.sub(r'^jsonp\d+(|)\s+$', '', response.text)), I still get the same error.

Error:

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Thoughts?

Problem has nothing to do with JSON

You're requesting a webpage, not a JSON API

$ curl -sH 'Accept: application/json' https://www.producthunt.com/ | head -c 200
<!DOCTYPE html><html lang="en"><head><title>Product Hunt – The best new products in tech.</title><link rel="canonical" href="https://www.producthunt.com/"/><meta name="description" content="Product %

You should use beautifulsoup or selenium-webdriver instead to extract HTML content, then parse to JSON, depending on your needs


And in reality, the site uses GraphQL at https://www.producthunt.com/frontend/graphql

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM