I'm unable to parse the html on this website correctly: https://nwis.waterdata.usgs.gov/usa/nwis/gwlevels/?site_no=332857117043301
I want to extract the line "Latitude 34°02'48.57", Longitude 117°02'09.16". While this shows up in the page source (web developer tools) in line 862, it doesn't show up when I parse via BeautifulSoup. Using the lxml parser does not produce the desired result either.
import requests
import re
from bs4 import BeautifulSoup
page = requests.get('https://nwis.waterdata.usgs.gov/usa/nwis/gwlevels/?site_no=340248117020902')
soup = BeautifulSoup(page.content, 'html.parser')
print (soup.prettify())
My print statement of the page content does not show the latitude/longitude line. How do I adjust my code to scrape this information?
import requests
from bs4 import BeautifulSoup
html = requests.get('https://nwis.waterdata.usgs.gov/usa/nwis/gwlevels/?site_no=340248117020902')
soup = BeautifulSoup(html.text, 'lxml')
data = soup.find_all('div', attrs={'align': 'left'})
latitude = ''.join(x.contents[0].split(',')[0] for x in data if 'Latitude' in x.contents[0])
longitude = ''.join(x.contents[0].split(',')[1].strip().replace('\n', '') for x in data if 'Longitude' in x.contents[0])
print(latitude)
print(longitude)
Output:
Latitude 34°02'48.57"
Longitude 117°02'09.16" NAD83
How are you searching for that specific content? You can find the data using .findAll('div')
and then searching for "Latitude"
in the tags' text:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://nwis.waterdata.usgs.gov/usa/nwis/gwlevels/?site_no=340248117020902')
soup = BeautifulSoup(page.content, 'html.parser')
divs = soup.findAll('div')
texts = [div.text for div in divs]
for text in texts:
if "Latitude" in text:
data = text
Resulting in a string that just needs a few parsing to get the numbers and assign them to variables:
>>> print(data)
Latitude 34°02'48.57", Longitude 117°02'09.16"
NAD83
That page is pure mess... just use regexp (working example python2):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
import re
def find(prefix, string):
return re.search("{} (?:\s+|)(\d+)\&\#176\;(\d+)\'(\d+)\.(\d+)\"".format(prefix), string)
def format_result(result):
return "{}°{}'{}.{}\"".format(
result.group(1),
result.group(2),
result.group(3),
result.group(4)
)
page = requests.get('https://nwis.waterdata.usgs.gov/usa/nwis/gwlevels/?site_no=340248117020902')
found_lat = find('Latitude', page.content)
found_lon = find('Longitude', page.content)
if found_lat and found_lon:
latitude = format_result(found_lat)
longitude = format_result(found_lon)
print('Cords: {} {}'.format(latitude, longitude))
Result:
Cords: 34°02'48.57" 117°02'09.16"
As you can see, you can get each number just like this from found_lat or found_lon:
print(found_lat.group(1)) # 34
print(found_lat.group(2)) # 02
print(found_lat.group(3)) # 48
print(found_lat.group(4)) # 57
Or Latitude or Longitude like this:
print(latitude) # 34°02'48.57"
print(longitude) # 117°02'09.16"
It is there. If you run the following code, you will get the Latitude, and you can replicate it for Longitude.
divs = soup.find_all('div')
lat_index = str(divs).find("Latitude")
lat = str(divs)[lat_index:lat_index+22 // 'Latitude\xa0 34°02\'48.57"'
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.