简体   繁体   中英

Error Scraping data from a website with “#” in the URL

I am trying to scrape data from a website (which has the # sign in the url: http://www.epa.ie/hydronet/#Water%20Levels ) using python but I receive the following error message when parsing it as a html file:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd">

<html><head><title>Bad Request</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/></head>
<body><h2>Bad Request - Invalid URL</h2>
<hr/><p>HTTP Error 400. The request URL is invalid.</p>
</body></html>

Any help appreciated.

The data on that page are loaded dynamically through Ajax. Looking at Firefox network inspector, there are lots of Json data files loaded up, for example this file (warning, huge!):

import requests
import json
from pprint import pprint

url = "http://www.epa.ie/Hydronet/output/internet/layers/10/index.json"

data = json.loads(requests.get(url).text)
pprint(data)

This will print data of about ~3000 stations in this format:

...
 {'L1_CATCHMENT_SIZE': '0.00 km²',
  'L1_DATA_AVAILABLE': 'Water Level Only',
  'L1_LTA_RAINFALL_1961_1990': '',
  'L1_ObjectDescription': '',
  'L1_RESPONSIBLE_BODY': 'Waterways Ireland',
  'L1_STATION_OWNER': 'Waterways Ireland',
  'L1_TYPE_OF_GAUGING': 'Recorder',
  'L1_WEB_GW_height_system_suffix': '',
  'L1_WTO_OBJECT': 'GRAND CANAL',
  'L1_Web_Desc': '',
  'L1_Web_ELT_95PERCENTILE': '',
  'L1_Web_E_50PERCENTILE': '',
  'L1_Web_Legend': 'Active Waterways Ireland',
  'L1_Web_Link': '<p><strong><b><a '
                 'href="http://netview.ott.com/waterwaysireland-le/">CLICK '
                 'HERE for Waterways Ireland Station Data</a></strong><b></p>',
  'L1_admin_name': '---',
  'L1_area_name': '',
  'L1_label': 'Stage',
  'L1_req_timestamp': None,
  'L1_river_name': 'GRAND CANAL',
  'L1_station_GWREF_DATUM': '',
  'L1_station_gauge_datum': '0.0',
  'L1_station_gauge_datum_unit': '---',
  'L1_station_status': 'Active',
  'L1_stationparameter_name': 'Stage',
  'L1_stationparameter_no': 'S',
  'L1_timestamp': None,
  'L1_ts_id': 40185010,
  'L1_ts_name': 'StaffGaugeCheck',
  'L1_ts_unitsymbol': 'm',
  'L1_ts_value': None,
  'L1_web_type_gw': '',
  'L1_web_waterbody': '',
  'metadata_CATCHMENT_SIZE': '0.00 km²',
  'metadata_RESPONSIBLE_BODY': 'Waterways Ireland',
  'metadata_STATION_OWNER': 'Waterways Ireland',
  'metadata_TYPE_OF_GAUGING': 'Recorder',
  'metadata_WTO_OBJECT': 'GRAND CANAL',
  'metadata_Web_ELT_95PERCENTILE': '',
  'metadata_Web_E_50PERCENTILE': '',
  'metadata_Web_Legend': 'Active Waterways Ireland',
  'metadata_admin_name': '---',
  'metadata_area_name': '',
  'metadata_catchment_name': 'Shannon',
  'metadata_river_name': 'GRAND CANAL',
  'metadata_station_carteasting': '224943.99999999997',
  'metadata_station_cartnorthing': '225708.0000000003',
  'metadata_station_gauge_datum_unit': '---',
  'metadata_station_id': '3049391',
  'metadata_station_latitude': '53.281124634829155',
  'metadata_station_local_x': '224943.99999999997',
  'metadata_station_local_y': '225708.0000000003',
  'metadata_station_longitude': '-7.625991577700672',
  'metadata_station_name': 'KIRKWINS BR',
  'metadata_station_no': '25069',
  'metadata_station_status': 'Active'},

  ... and so on

There are few other data files, you need to look at network inspector for URLs.

Edit:

To print 'metadata_station_name' and 'L1_ts_value', you can use this code:

import requests
import json

url = "http://epa.ie/Hydronet/output/internet/layers/20/index.json"

data = json.loads(requests.get(url).text)
for station in data:
    print(station['metadata_station_name'], station['L1_ts_value'])
    print('-' * 80)

Prints:

BALLYMAN None
--------------------------------------------------------------------------------
CARRIGAHORIG 0.351
--------------------------------------------------------------------------------
KILCOLGAN 0.668
--------------------------------------------------------------------------------
CASTLEMARTYR None
--------------------------------------------------------------------------------
BALLEA 0.376
--------------------------------------------------------------------------------
CAHERFINESKER 0.000
--------------------------------------------------------------------------------
PORTUMNA 701.200
--------------------------------------------------------------------------------
... and so on.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM