簡體   English   中英

從網址中帶有“#”的網站抓取數據時出錯

[英]Error Scraping data from a website with “#” in the URL

我試圖從一個網站(其網址中的#號:刮數據http://www.epa.ie/hydronet/#Water%20Levels )使用python,但解析它作為一個當我收到以下錯誤消息html文件:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd">

<html><head><title>Bad Request</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/></head>
<body><h2>Bad Request - Invalid URL</h2>
<hr/><p>HTTP Error 400. The request URL is invalid.</p>
</body></html>

任何幫助表示贊賞。

該頁面上的數據通過Ajax動態加載。 查看Firefox網絡檢查器,有很多Json數據文件正在加載,例如,此文件(警告,巨大!):

import requests
import json
from pprint import pprint

url = "http://www.epa.ie/Hydronet/output/internet/layers/10/index.json"

data = json.loads(requests.get(url).text)
pprint(data)

這將以以下格式打印約3000個站的數據:

...
 {'L1_CATCHMENT_SIZE': '0.00 km²',
  'L1_DATA_AVAILABLE': 'Water Level Only',
  'L1_LTA_RAINFALL_1961_1990': '',
  'L1_ObjectDescription': '',
  'L1_RESPONSIBLE_BODY': 'Waterways Ireland',
  'L1_STATION_OWNER': 'Waterways Ireland',
  'L1_TYPE_OF_GAUGING': 'Recorder',
  'L1_WEB_GW_height_system_suffix': '',
  'L1_WTO_OBJECT': 'GRAND CANAL',
  'L1_Web_Desc': '',
  'L1_Web_ELT_95PERCENTILE': '',
  'L1_Web_E_50PERCENTILE': '',
  'L1_Web_Legend': 'Active Waterways Ireland',
  'L1_Web_Link': '<p><strong><b><a '
                 'href="http://netview.ott.com/waterwaysireland-le/">CLICK '
                 'HERE for Waterways Ireland Station Data</a></strong><b></p>',
  'L1_admin_name': '---',
  'L1_area_name': '',
  'L1_label': 'Stage',
  'L1_req_timestamp': None,
  'L1_river_name': 'GRAND CANAL',
  'L1_station_GWREF_DATUM': '',
  'L1_station_gauge_datum': '0.0',
  'L1_station_gauge_datum_unit': '---',
  'L1_station_status': 'Active',
  'L1_stationparameter_name': 'Stage',
  'L1_stationparameter_no': 'S',
  'L1_timestamp': None,
  'L1_ts_id': 40185010,
  'L1_ts_name': 'StaffGaugeCheck',
  'L1_ts_unitsymbol': 'm',
  'L1_ts_value': None,
  'L1_web_type_gw': '',
  'L1_web_waterbody': '',
  'metadata_CATCHMENT_SIZE': '0.00 km²',
  'metadata_RESPONSIBLE_BODY': 'Waterways Ireland',
  'metadata_STATION_OWNER': 'Waterways Ireland',
  'metadata_TYPE_OF_GAUGING': 'Recorder',
  'metadata_WTO_OBJECT': 'GRAND CANAL',
  'metadata_Web_ELT_95PERCENTILE': '',
  'metadata_Web_E_50PERCENTILE': '',
  'metadata_Web_Legend': 'Active Waterways Ireland',
  'metadata_admin_name': '---',
  'metadata_area_name': '',
  'metadata_catchment_name': 'Shannon',
  'metadata_river_name': 'GRAND CANAL',
  'metadata_station_carteasting': '224943.99999999997',
  'metadata_station_cartnorthing': '225708.0000000003',
  'metadata_station_gauge_datum_unit': '---',
  'metadata_station_id': '3049391',
  'metadata_station_latitude': '53.281124634829155',
  'metadata_station_local_x': '224943.99999999997',
  'metadata_station_local_y': '225708.0000000003',
  'metadata_station_longitude': '-7.625991577700672',
  'metadata_station_name': 'KIRKWINS BR',
  'metadata_station_no': '25069',
  'metadata_station_status': 'Active'},

  ... and so on

其他數據文件很少,您需要查看網絡檢查器的URL。

編輯:

要打印“ metadata_station_name”和“ L1_ts_value”,可以使用以下代碼:

import requests
import json

url = "http://epa.ie/Hydronet/output/internet/layers/20/index.json"

data = json.loads(requests.get(url).text)
for station in data:
    print(station['metadata_station_name'], station['L1_ts_value'])
    print('-' * 80)

印刷品:

BALLYMAN None
--------------------------------------------------------------------------------
CARRIGAHORIG 0.351
--------------------------------------------------------------------------------
KILCOLGAN 0.668
--------------------------------------------------------------------------------
CASTLEMARTYR None
--------------------------------------------------------------------------------
BALLEA 0.376
--------------------------------------------------------------------------------
CAHERFINESKER 0.000
--------------------------------------------------------------------------------
PORTUMNA 701.200
--------------------------------------------------------------------------------
... and so on.

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM