I used beautifulsoup
library to get data from a webpage
http://open.dataforcities.org/details?4[]=2016
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://open.dataforcities.org/details?4[]=2016').read())
Now soup
looks like the following (I show just a part of it):
soup('table):
[<table>\n<tr class="theme-cells" id="profile_indicators" ng-mouseover='updateIndicatorsScroll( "Profile Indicators" )'>\n<td class="theme-text">\n<h1>4 Profile Indicators</h1>\n</td>\n<td class="metrics">\n<div class="metric-p metric-title"></div>\n</td>\n</tr>\n<tr class="indicator-cells" ng-mouseover='updateIndicatorsScroll( "Profile Indicators" )' onmouseout="$(this).removeClass('indicator-cells-hover')" onmouseover="$(this).addClass('indicator-cells-hover')">\n<td class="indicator-text">\n<h2>4.1 Total city population (Profile)</h2>\n</td>\n<td class="metrics">\n<div class="metric-p metric-title"></div>\n<div class="metric-p-also bigger">669 469 (2015)</div>\n<div class="full-bar" style="width:100%">\n<div class="metric-bar" style="width:3.6411942141077174%; background-color:#ffffff"></div>\n</div>\n</td>\n</tr>\n<tr class="indicator-cells" ng-mouseover='updateIndicatorsScroll( "Profile Indicators" )' onmouseout="$(this).removeClass('indicator-cells-hover')" onmouseover="$(this).addClass('indicator-cells-hover')">\n<td class="indicator-text">\n<h2>4.2 City land area (Profile)</h2>\n</td>\n<td class="metrics">\n<div class="metric-p metric-title"></div>\n<div class="metric-p-also bigger">125 km\xb2 (2010)</div>\n<div class="full-bar" style="width:100%">\n<div class="metric-bar" style="width:1.9604120789229098%; background-color:#ffffff"></div>\n</div>\n</td>\n</tr>\n<tr class="indicator-cells" ng-mouseover='updateIndicatorsScroll( "Profile Indicators" )' onmouseout="$(this).removeClass('indicator-cells-hover')" onmouseover="$(this).addClass('indicator-cells-hover')">\n<td class="indicator-text">\n<h2>4.3 Population density (Profile)</h2>\n</td>\n<td class="metrics">\n<div class="metric-p metric-title"></div>\n<div class="metric-p-also bigger">5 354 /km\xb2 (2015)</div>\n<div class="full-bar" style="width:100%">\n<div class="metric-bar" style="width:27.890485963282238%; background-color:#ffffff"></div>\n</div>\n</td>\n</tr>\n<tr class="indicator-cells" ng-mouseover='updateIndicatorsScroll( "Profile Indicators" )'
How can I extract data from soup
? If I follow the example in Web scraping with Python I got the following error:
soup = BeautifulSoup(urllib2.urlopen('http://open.dataforcities.org/details?4[]=2016').read())
for row in soup('table', {'class': 'metrics'})[0].tbody('tr'):
tds = row('td')
print tds[0].string, tds[1].string
IndexError Traceback (most recent call last)
<ipython-input-71-d688ff354182> in <module>()
----> 1 for row in soup('table', {'class': 'metrics'})[0].tbody('tr'):
2 tds = row('td')
3 print tds[0].string, tds[1].string
IndexError: list index out of range
The table in your html has no 'metrics' class, so your expression ( 'table.metrics'
) returns an empty list, which gives you an IndexError
when you try to select the first item.
Since there is only one table on the page, and it has no attributes, you can get all the rows with this expression: 'table tr'
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://open.dataforcities.org/details?4[]=2016').read()
soup = BeautifulSoup(html, 'html.parser')
for row in soup.select('table tr'):
tds = row('td')
print tds[0].text.strip(), tds[1].text.strip()
Also make sure to use bs4
instead of bs3
, and if possible update to Python3.
Basically this code extracts your data and saves them into into a csv for you to acces(btw i feel like your data is incomplete) I would recommend opening that link and downloading the file as a html file because UnicodeEncodeError if you try to use urlopener to extract it
from bs4 import BeautifulSoup
import csv
soup=BeautifulSoup(open("Yourfile.html"),"html.parser")
f = csv.writer(open("file.csv", "w"))
f.writerow(["Information"])
h2s=soup.find_all("h2")
for h2 in h2s:
name=h2.contents[0]
f.writerow([name])
By the way incase you want to still use urlopener urllib2 doesnt exist anymore so it is actually
from urllib.request import urlopen
html =urlopen('http://open.dataforcities.org/details?4[]=2016').read()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.