簡體   English   中英

用 python 和 json 刮這個網站的正確 URL 是什么?

[英]What is the proper URL to scrape this website with python and json?

試圖抓取這個網站 --> https://ucr.gov/enforcement/1000511它曾經使用下面的代碼,然后停止了。 無法獲得 json 或響應中的任何內容。

query = "1000511"

url = 'https://ucr.gov/api/enforcement/{}'.format(query)


headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
    'Accept': 'application/json, text/plain, */*',
    'Accept-Language': 'en-US,en;q=0.9',
    'Referer': 'https://ucr.gov/enforcement/{}'.format(query),
    'Cache-Control': 'no-cache,no-store,must-revalidate,max-age=0,private',
    'content-type': 'application/json;charset=UTF-8',
    'sec-fetch-dest': 'empty',
    'sec-fetch-mode': 'cors',
    'sec-fetch-site': 'same-site',
    'UCR-UI-Version': '20.5.4',
    'Origin': 'https://ucr.gov',
    'Connection': 'keep-alive',
}

s = requests.Session()

params = (
    ('pageNumber', '0'),
    ('itemsPerPage', '15'),
)

response = s.get(url, headers=headers, params=params)

response.json()

預期的內容可以在這里找到: https://ucr.gov/enforcement/1000511

相反,我收到此錯誤:

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

同樣,這曾經在幾周前起作用。 請幫我找出錯誤。

更正 1: - 我最初將 url 發布為:

url = 'https://admin.ucr.gov/api/enforcement/{}'.format(query)

這是以前的工作方式。 現在,我看到該網站使用相同的 url 但沒有“管理員”(上面的代碼為此更改)。 但是如果您訪問,我仍然沒有得到任何預期的結果/內容: https://ucr.gov/enforcement/1000511

使用(例如)Chrome 的 DevTools,您可以看到進行了以下調用: 在此處輸入圖像描述

然后,您可以將其復制為 cUrl 並在命令行上嘗試執行此操作所需的標頭:

$ curl 'https://admin.ucr.gov/api/enforcement' \
>   -H 'authority: admin.ucr.gov' \
>   -H 'accept: application/json, text/plain, */*' \
>   -H 'cache-control: no-cache,no-store,must-revalidate,max-age=0,private' \
>   -H 'ucr-ui-version: 20.5.4' \
>   -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36' \
>   -H 'dnt: 1' \
>   -H 'content-type: application/json;charset=UTF-8' \
>   -H 'origin: https://ucr.gov' \
>   -H 'sec-fetch-site: same-site' \
>   -H 'sec-fetch-mode: cors' \
>   -H 'sec-fetch-dest: empty' \
>   -H 'referer: https://ucr.gov/enforcement/1000511' \
>   -H 'accept-language: it-IT,it;q=0.9,en-US;q=0.8,en;q=0.7' \
>   --data-binary '{"searchTerm":"1000511","itemsPerPage":15,"pageNumber":0}' \
>   --compressed
{"carrier":{"usdot":1000511,"legalName":"877599 ALBERTA LTD","dateAdded":"2002-01-24T00:00:00Z","physicalAddress":{"street":"430 66 STREET SW","city":"EDMONTON","state":"AB","region":"CAAB","zipCode":"T6X 1A3","country":"C","countryCode":"CA"},"mailingAddress":{"street":"430 66 STREET SW","city":"EDMONTON","state":"AB","region":"CAAB","zipCode":"T6X 1A3","country":"C","countryCode":"CA"}},"history":{"enforcementRegistrations":[{"year":2020,"status":"unregistered","updateTime":"2020-06-15T16:12:53.5166421+00:00","isApplicable":true,"isYearActive":true,"updateTimeDisplay":"06/15/2020 16:12"},{"year":2019,"status":"unregistered","updateTime":"2020-06-15T16:12:53.53272+00:00","isApplicable":true,"isYearActive":true,"updateTimeDisplay":"06/15/2020 16:12"},{"year":2018,"status":"unregistered","updateTime":"2020-06-15T16:12:53.5486724+00:00","isApplicable":true,"isYearActive":false,"updateTimeDisplay":"06/15/2020 16:12"},{"year":2017,"status":"unregistered","updateTime":"2020-06-15T16:12:53.5646021+00:00","isApplicable":true,"isYearActive":false,"updateTimeDisplay":"06/15/2020 16:12"}]}}

現在你可以嘗試一個一個的去掉headers,你會發現這個請求成功了:

$ curl 'https://admin.ucr.gov/api/enforcement' --data-binary '{"searchTerm":"1000511"}'    -H 'ucr-ui-version: 20.5.4'    -H 'content-type: application/json;charset=UTF-8'
{"carrier":{"usdot":1000511,"legalName":"877599 ALBERTA LTD","dateAdded":"2002-01-24T00:00:00Z","physicalAddress":{"street":"430 66 STREET SW","city":"EDMONTON","state":"AB","region":"CAAB","zipCode":"T6X 1A3","country":"C","countryCode":"CA"},"mailingAddress":{"street":"430 66 STREET SW","city":"EDMONTON","state":"AB","region":"CAAB","zipCode":"T6X 1A3","country":"C","countryCode":"CA"}},"history":{"enforcementRegistrations":[{"year":2020,"status":"unregistered","updateTime":"2020-06-15T16:20:41.3271743+00:00","isApplicable":true,"isYearActive":true,"updateTimeDisplay":"06/15/2020 16:20"},{"year":2019,"status":"unregistered","updateTime":"2020-06-15T16:20:41.3951487+00:00","isApplicable":true,"isYearActive":true,"updateTimeDisplay":"06/15/2020 16:20"},{"year":2018,"status":"unregistered","updateTime":"2020-06-15T16:20:41.468421+00:00","isApplicable":true,"isYearActive":false,"updateTimeDisplay":"06/15/2020 16:20"},{"year":2017,"status":"unregistered","updateTime":"2020-06-15T16:20:41.5511652+00:00","isApplicable":true,"isYearActive":false,"updateTimeDisplay":"06/15/2020 16:20"}]}}

現在將curl調用轉換為 python。 請注意,調用是 POST 而不是 GET,如您的代碼中所示:

In [1]: import requests                                                                                                                                                                                                                                                                  

In [2]: import io 
   ...: response = requests.post('https://admin.ucr.gov/api/enforcement', data=io.StringIO('{"searchTerm":"1000511"}'), headers={'ucr-ui-version': '20.5.4', 'content-type': 'application/json;charset=UTF-8'})                                                                          

In [3]: response.status_code                                                                                                                                                                                                                                                             
Out[3]: 200

In [4]: response.json()                                                                                                                                                                                                                                                                  
Out[4]: 
{'carrier': {'usdot': 1000511,
  'legalName': '877599 ALBERTA LTD',
  'dateAdded': '2002-01-24T00:00:00Z',
  'physicalAddress': {'street': '430 66 STREET SW',
   'city': 'EDMONTON',
   'state': 'AB',
   'region': 'CAAB',
   'zipCode': 'T6X 1A3',
   'country': 'C',
   'countryCode': 'CA'},
  'mailingAddress': {'street': '430 66 STREET SW',
   'city': 'EDMONTON',
   'state': 'AB',
   'region': 'CAAB',
   'zipCode': 'T6X 1A3',
   'country': 'C',
   'countryCode': 'CA'}},
 'history': {'enforcementRegistrations': [{'year': 2020,
    'status': 'unregistered',
    'updateTime': '2020-06-15T16:23:10.2114276+00:00',
    'isApplicable': True,
    'isYearActive': True,
    'updateTimeDisplay': '06/15/2020 16:23'},
   {'year': 2019,
    'status': 'unregistered',
    'updateTime': '2020-06-15T16:23:10.278671+00:00',
    'isApplicable': True,
    'isYearActive': True,
    'updateTimeDisplay': '06/15/2020 16:23'},
   {'year': 2018,
    'status': 'unregistered',
    'updateTime': '2020-06-15T16:23:10.3507073+00:00',
    'isApplicable': True,
    'isYearActive': False,
    'updateTimeDisplay': '06/15/2020 16:23'},
   {'year': 2017,
    'status': 'unregistered',
    'updateTime': '2020-06-15T16:23:10.4026579+00:00',
    'isApplicable': True,
    'isYearActive': False,
    'updateTimeDisplay': '06/15/2020 16:23'}]}}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM