[英]What is the proper URL to scrape this website with python and json?
試圖抓取這個網站 --> https://ucr.gov/enforcement/1000511它曾經使用下面的代碼,然后停止了。 無法獲得 json 或響應中的任何內容。
query = "1000511"
url = 'https://ucr.gov/api/enforcement/{}'.format(query)
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
'Accept': 'application/json, text/plain, */*',
'Accept-Language': 'en-US,en;q=0.9',
'Referer': 'https://ucr.gov/enforcement/{}'.format(query),
'Cache-Control': 'no-cache,no-store,must-revalidate,max-age=0,private',
'content-type': 'application/json;charset=UTF-8',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-site',
'UCR-UI-Version': '20.5.4',
'Origin': 'https://ucr.gov',
'Connection': 'keep-alive',
}
s = requests.Session()
params = (
('pageNumber', '0'),
('itemsPerPage', '15'),
)
response = s.get(url, headers=headers, params=params)
response.json()
預期的內容可以在這里找到: https://ucr.gov/enforcement/1000511
相反,我收到此錯誤:
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
同樣,這曾經在幾周前起作用。 請幫我找出錯誤。
更正 1: - 我最初將 url 發布為:
url = 'https://admin.ucr.gov/api/enforcement/{}'.format(query)
這是以前的工作方式。 現在,我看到該網站使用相同的 url 但沒有“管理員”(上面的代碼為此更改)。 但是如果您訪問,我仍然沒有得到任何預期的結果/內容: https://ucr.gov/enforcement/1000511
使用(例如)Chrome 的 DevTools,您可以看到進行了以下調用:
然后,您可以將其復制為 cUrl 並在命令行上嘗試執行此操作所需的標頭:
$ curl 'https://admin.ucr.gov/api/enforcement' \
> -H 'authority: admin.ucr.gov' \
> -H 'accept: application/json, text/plain, */*' \
> -H 'cache-control: no-cache,no-store,must-revalidate,max-age=0,private' \
> -H 'ucr-ui-version: 20.5.4' \
> -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36' \
> -H 'dnt: 1' \
> -H 'content-type: application/json;charset=UTF-8' \
> -H 'origin: https://ucr.gov' \
> -H 'sec-fetch-site: same-site' \
> -H 'sec-fetch-mode: cors' \
> -H 'sec-fetch-dest: empty' \
> -H 'referer: https://ucr.gov/enforcement/1000511' \
> -H 'accept-language: it-IT,it;q=0.9,en-US;q=0.8,en;q=0.7' \
> --data-binary '{"searchTerm":"1000511","itemsPerPage":15,"pageNumber":0}' \
> --compressed
{"carrier":{"usdot":1000511,"legalName":"877599 ALBERTA LTD","dateAdded":"2002-01-24T00:00:00Z","physicalAddress":{"street":"430 66 STREET SW","city":"EDMONTON","state":"AB","region":"CAAB","zipCode":"T6X 1A3","country":"C","countryCode":"CA"},"mailingAddress":{"street":"430 66 STREET SW","city":"EDMONTON","state":"AB","region":"CAAB","zipCode":"T6X 1A3","country":"C","countryCode":"CA"}},"history":{"enforcementRegistrations":[{"year":2020,"status":"unregistered","updateTime":"2020-06-15T16:12:53.5166421+00:00","isApplicable":true,"isYearActive":true,"updateTimeDisplay":"06/15/2020 16:12"},{"year":2019,"status":"unregistered","updateTime":"2020-06-15T16:12:53.53272+00:00","isApplicable":true,"isYearActive":true,"updateTimeDisplay":"06/15/2020 16:12"},{"year":2018,"status":"unregistered","updateTime":"2020-06-15T16:12:53.5486724+00:00","isApplicable":true,"isYearActive":false,"updateTimeDisplay":"06/15/2020 16:12"},{"year":2017,"status":"unregistered","updateTime":"2020-06-15T16:12:53.5646021+00:00","isApplicable":true,"isYearActive":false,"updateTimeDisplay":"06/15/2020 16:12"}]}}
現在你可以嘗試一個一個的去掉headers,你會發現這個請求成功了:
$ curl 'https://admin.ucr.gov/api/enforcement' --data-binary '{"searchTerm":"1000511"}' -H 'ucr-ui-version: 20.5.4' -H 'content-type: application/json;charset=UTF-8'
{"carrier":{"usdot":1000511,"legalName":"877599 ALBERTA LTD","dateAdded":"2002-01-24T00:00:00Z","physicalAddress":{"street":"430 66 STREET SW","city":"EDMONTON","state":"AB","region":"CAAB","zipCode":"T6X 1A3","country":"C","countryCode":"CA"},"mailingAddress":{"street":"430 66 STREET SW","city":"EDMONTON","state":"AB","region":"CAAB","zipCode":"T6X 1A3","country":"C","countryCode":"CA"}},"history":{"enforcementRegistrations":[{"year":2020,"status":"unregistered","updateTime":"2020-06-15T16:20:41.3271743+00:00","isApplicable":true,"isYearActive":true,"updateTimeDisplay":"06/15/2020 16:20"},{"year":2019,"status":"unregistered","updateTime":"2020-06-15T16:20:41.3951487+00:00","isApplicable":true,"isYearActive":true,"updateTimeDisplay":"06/15/2020 16:20"},{"year":2018,"status":"unregistered","updateTime":"2020-06-15T16:20:41.468421+00:00","isApplicable":true,"isYearActive":false,"updateTimeDisplay":"06/15/2020 16:20"},{"year":2017,"status":"unregistered","updateTime":"2020-06-15T16:20:41.5511652+00:00","isApplicable":true,"isYearActive":false,"updateTimeDisplay":"06/15/2020 16:20"}]}}
現在將curl
調用轉換為 python。 請注意,調用是 POST 而不是 GET,如您的代碼中所示:
In [1]: import requests
In [2]: import io
...: response = requests.post('https://admin.ucr.gov/api/enforcement', data=io.StringIO('{"searchTerm":"1000511"}'), headers={'ucr-ui-version': '20.5.4', 'content-type': 'application/json;charset=UTF-8'})
In [3]: response.status_code
Out[3]: 200
In [4]: response.json()
Out[4]:
{'carrier': {'usdot': 1000511,
'legalName': '877599 ALBERTA LTD',
'dateAdded': '2002-01-24T00:00:00Z',
'physicalAddress': {'street': '430 66 STREET SW',
'city': 'EDMONTON',
'state': 'AB',
'region': 'CAAB',
'zipCode': 'T6X 1A3',
'country': 'C',
'countryCode': 'CA'},
'mailingAddress': {'street': '430 66 STREET SW',
'city': 'EDMONTON',
'state': 'AB',
'region': 'CAAB',
'zipCode': 'T6X 1A3',
'country': 'C',
'countryCode': 'CA'}},
'history': {'enforcementRegistrations': [{'year': 2020,
'status': 'unregistered',
'updateTime': '2020-06-15T16:23:10.2114276+00:00',
'isApplicable': True,
'isYearActive': True,
'updateTimeDisplay': '06/15/2020 16:23'},
{'year': 2019,
'status': 'unregistered',
'updateTime': '2020-06-15T16:23:10.278671+00:00',
'isApplicable': True,
'isYearActive': True,
'updateTimeDisplay': '06/15/2020 16:23'},
{'year': 2018,
'status': 'unregistered',
'updateTime': '2020-06-15T16:23:10.3507073+00:00',
'isApplicable': True,
'isYearActive': False,
'updateTimeDisplay': '06/15/2020 16:23'},
{'year': 2017,
'status': 'unregistered',
'updateTime': '2020-06-15T16:23:10.4026579+00:00',
'isApplicable': True,
'isYearActive': False,
'updateTimeDisplay': '06/15/2020 16:23'}]}}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.