Error Scraping data from a website with “#” in the URL

Question

I am trying to scrape data from a website (which has the # sign in the url: http://www.epa.ie/hydronet/#Water%20Levels ) using python but I receive the following error message when parsing it as a html file:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd">

<html><head><title>Bad Request</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/></head>
<body><h2>Bad Request - Invalid URL</h2>
<hr/><p>HTTP Error 400. The request URL is invalid.</p>
</body></html>

Any help appreciated.

Answer 1

The data on that page are loaded dynamically through Ajax. Looking at Firefox network inspector, there are lots of Json data files loaded up, for example this file (warning, huge!):

import requests
import json
from pprint import pprint

url = "http://www.epa.ie/Hydronet/output/internet/layers/10/index.json"

data = json.loads(requests.get(url).text)
pprint(data)

This will print data of about ~3000 stations in this format:

...
 {'L1_CATCHMENT_SIZE': '0.00 km²',
  'L1_DATA_AVAILABLE': 'Water Level Only',
  'L1_LTA_RAINFALL_1961_1990': '',
  'L1_ObjectDescription': '',
  'L1_RESPONSIBLE_BODY': 'Waterways Ireland',
  'L1_STATION_OWNER': 'Waterways Ireland',
  'L1_TYPE_OF_GAUGING': 'Recorder',
  'L1_WEB_GW_height_system_suffix': '',
  'L1_WTO_OBJECT': 'GRAND CANAL',
  'L1_Web_Desc': '',
  'L1_Web_ELT_95PERCENTILE': '',
  'L1_Web_E_50PERCENTILE': '',
  'L1_Web_Legend': 'Active Waterways Ireland',
  'L1_Web_Link': '<p><strong><b><a '
                 'href="http://netview.ott.com/waterwaysireland-le/">CLICK '
                 'HERE for Waterways Ireland Station Data</a></strong><b></p>',
  'L1_admin_name': '---',
  'L1_area_name': '',
  'L1_label': 'Stage',
  'L1_req_timestamp': None,
  'L1_river_name': 'GRAND CANAL',
  'L1_station_GWREF_DATUM': '',
  'L1_station_gauge_datum': '0.0',
  'L1_station_gauge_datum_unit': '---',
  'L1_station_status': 'Active',
  'L1_stationparameter_name': 'Stage',
  'L1_stationparameter_no': 'S',
  'L1_timestamp': None,
  'L1_ts_id': 40185010,
  'L1_ts_name': 'StaffGaugeCheck',
  'L1_ts_unitsymbol': 'm',
  'L1_ts_value': None,
  'L1_web_type_gw': '',
  'L1_web_waterbody': '',
  'metadata_CATCHMENT_SIZE': '0.00 km²',
  'metadata_RESPONSIBLE_BODY': 'Waterways Ireland',
  'metadata_STATION_OWNER': 'Waterways Ireland',
  'metadata_TYPE_OF_GAUGING': 'Recorder',
  'metadata_WTO_OBJECT': 'GRAND CANAL',
  'metadata_Web_ELT_95PERCENTILE': '',
  'metadata_Web_E_50PERCENTILE': '',
  'metadata_Web_Legend': 'Active Waterways Ireland',
  'metadata_admin_name': '---',
  'metadata_area_name': '',
  'metadata_catchment_name': 'Shannon',
  'metadata_river_name': 'GRAND CANAL',
  'metadata_station_carteasting': '224943.99999999997',
  'metadata_station_cartnorthing': '225708.0000000003',
  'metadata_station_gauge_datum_unit': '---',
  'metadata_station_id': '3049391',
  'metadata_station_latitude': '53.281124634829155',
  'metadata_station_local_x': '224943.99999999997',
  'metadata_station_local_y': '225708.0000000003',
  'metadata_station_longitude': '-7.625991577700672',
  'metadata_station_name': 'KIRKWINS BR',
  'metadata_station_no': '25069',
  'metadata_station_status': 'Active'},

  ... and so on

There are few other data files, you need to look at network inspector for URLs.

Edit:

To print 'metadata_station_name' and 'L1_ts_value', you can use this code:

import requests
import json

url = "http://epa.ie/Hydronet/output/internet/layers/20/index.json"

data = json.loads(requests.get(url).text)
for station in data:
    print(station['metadata_station_name'], station['L1_ts_value'])
    print('-' * 80)

Prints:

BALLYMAN None
--------------------------------------------------------------------------------
CARRIGAHORIG 0.351
--------------------------------------------------------------------------------
KILCOLGAN 0.668
--------------------------------------------------------------------------------
CASTLEMARTYR None
--------------------------------------------------------------------------------
BALLEA 0.376
--------------------------------------------------------------------------------
CAHERFINESKER 0.000
--------------------------------------------------------------------------------
PORTUMNA 701.200
--------------------------------------------------------------------------------
... and so on.

Error Scraping data from a website with “#” in the URL

Question

1 answers

solution1
1 ACCPTED 2018-07-23 14:13:00

Error Scraping data from a website with “#” in the URL

Question

1 answers

solution1 1 ACCPTED 2018-07-23 14:13:00

solution1
1 ACCPTED 2018-07-23 14:13:00