简体   繁体   中英

Parsing HTML on Website for Scraping

I'm unable to parse the html on this website correctly: https://nwis.waterdata.usgs.gov/usa/nwis/gwlevels/?site_no=332857117043301

I want to extract the line "Latitude 34°02'48.57", Longitude 117°02'09.16". While this shows up in the page source (web developer tools) in line 862, it doesn't show up when I parse via BeautifulSoup. Using the lxml parser does not produce the desired result either.

import requests
import re
from bs4 import BeautifulSoup

page = requests.get('https://nwis.waterdata.usgs.gov/usa/nwis/gwlevels/?site_no=340248117020902')
soup = BeautifulSoup(page.content, 'html.parser')

print (soup.prettify())

My print statement of the page content does not show the latitude/longitude line. How do I adjust my code to scrape this information?

import requests
from bs4 import BeautifulSoup

html = requests.get('https://nwis.waterdata.usgs.gov/usa/nwis/gwlevels/?site_no=340248117020902')
soup = BeautifulSoup(html.text, 'lxml')

data = soup.find_all('div', attrs={'align': 'left'})

latitude = ''.join(x.contents[0].split(',')[0] for x in data if 'Latitude' in x.contents[0])
longitude = ''.join(x.contents[0].split(',')[1].strip().replace('\n', '') for x in data if 'Longitude' in x.contents[0])

print(latitude)
print(longitude)

Output:

Latitude  34°02'48.57" 
Longitude 117°02'09.16" NAD83

How are you searching for that specific content? You can find the data using .findAll('div') and then searching for "Latitude" in the tags' text:

import requests
from bs4 import BeautifulSoup

page = requests.get('https://nwis.waterdata.usgs.gov/usa/nwis/gwlevels/?site_no=340248117020902')
soup = BeautifulSoup(page.content, 'html.parser')

divs = soup.findAll('div')
texts = [div.text for div in divs]

for text in texts:
    if "Latitude" in text:
        data = text        

Resulting in a string that just needs a few parsing to get the numbers and assign them to variables:

>>> print(data)
Latitude  34°02'48.57", Longitude 117°02'09.16"
 NAD83

That page is pure mess... just use regexp (working example python2):

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import requests
import re


def find(prefix, string):
    return re.search("{} (?:\s+|)(\d+)\&\#176\;(\d+)\'(\d+)\.(\d+)\"".format(prefix), string)


def format_result(result):
    return "{}°{}'{}.{}\"".format(
        result.group(1),
        result.group(2),
        result.group(3),
        result.group(4)
    )

page = requests.get('https://nwis.waterdata.usgs.gov/usa/nwis/gwlevels/?site_no=340248117020902')
found_lat = find('Latitude', page.content)
found_lon = find('Longitude', page.content)
if found_lat and found_lon:
    latitude = format_result(found_lat)
    longitude = format_result(found_lon)
    print('Cords: {} {}'.format(latitude, longitude))

Result:

Cords: 34°02'48.57" 117°02'09.16"

As you can see, you can get each number just like this from found_lat or found_lon:

print(found_lat.group(1)) # 34
print(found_lat.group(2)) # 02
print(found_lat.group(3)) # 48
print(found_lat.group(4)) # 57

Or Latitude or Longitude like this:

print(latitude) # 34°02'48.57"
print(longitude) # 117°02'09.16"

It is there. If you run the following code, you will get the Latitude, and you can replicate it for Longitude.

divs = soup.find_all('div')
lat_index = str(divs).find("Latitude")
lat = str(divs)[lat_index:lat_index+22 // 'Latitude\xa0 34°02\'48.57"'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM