When attempting to scrape producthunt,
import requests
headers = {
'authority': 'www.producthunt.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'upgrade-insecure-requests': '1',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9',
}
response = requests.get('https://www.producthunt.com/', headers=headers)
I found that the returned response doesn't have a valid string to convert to json. After trying to replace the type of quote with response.text.replace() and to return the json with json.loads(re.sub(r'^jsonp\d+(|)\s+$', '', response.text)), I still get the same error.
Error:
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Thoughts?
Problem has nothing to do with JSON
You're requesting a webpage, not a JSON API
$ curl -sH 'Accept: application/json' https://www.producthunt.com/ | head -c 200
<!DOCTYPE html><html lang="en"><head><title>Product Hunt – The best new products in tech.</title><link rel="canonical" href="https://www.producthunt.com/"/><meta name="description" content="Product %
You should use beautifulsoup
or selenium-webdriver
instead to extract HTML content, then parse to JSON, depending on your needs
And in reality, the site uses GraphQL at https://www.producthunt.com/frontend/graphql
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.