简体   繁体   中英

Error scraping js generated table with beautiful soup

I am trying to scrape a table in python 2.7 using Beautiful Soup and/or Selenium (no pandas, lxml). Specific columns from the table need to be written to a csv file. I have looked to most of the similar questions( 12548793 , 30734963, 33448974 , 32434378 and more) but nothing worked for me so far. Obviously, this is my first attempt to scrape anything, so I don't even pretend that I understand half of what I am doing.
The code below works somewhat:

import urllib2
import bs4
from bs4 import BeautifulSoup
import csv

url = "http://data.dnr.nebraska.gov/RealTime/Gage/Index?StationSource=1&StationType=3&RiverBasin=" 

page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page, "html.parser")

#get table headers for the columns of interest
#Data of interest:['Station_Name', 'Station_number', 'Date_time', 'Stage', 'Discharge'])

table1 = soup.find("table", id="StationNames")
ths = table1.findAll('th')
headers = (ths[0].text, ths[1].text, ths[2].text, ths[3].text, ths[4].text)

#print headers
#get measurements
table = soup.find_all('table', {"class":"btn-NDNR BlueUnderline"})
for tr in soup.find_all('tr')[2:]:
    tds = tr.find_all('td')
    ncontent =(tds[0].text, tds[1].text, tds[2].text, tds[3].text, tds[4].text)
    #print ncontent
#write the csv file

with open('E:/test/nebraska.csv', 'a') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(headers)
        writer.writerow(ncontent)
        #writer.writerow([value.get_text(strip=True).encode("utf-8") for value in ncontent])  

Except that the csv table is empty, and while I print, this is what I am getting:

 (u'\r\n                                Station Name\r\n                            ', u'\r\n                                Station Number\r\n                            ', u'\r\n                                Date Time (UTC)\r\n                            ', u'\r\n                                Stage\r\n                            ', u'\r\n                                Discharge\r\n                            ')
    (u'\nBig Blue River at Beatrice - NDNR ', u'\r\n                                            6881500\r\n                                        ', u'\r\n                                            01/05/2016 14:45 \r\n                                        ', u'\r\n                                            4.27\r\n                                        ', u'\r\n                                            524.62\r\n                                        ')  

Also, is there a more efficient and faster way of doing this?
Thank you in advance - any help will be greatly appreciated.

Several errors:

  1. You need to strip all the text. For example, tds[0].text.strip()
  2. You just write the last row of the table. The ncontent variable was rewritten durning the loop.

Fix the errors and you will be good to go.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM