简体   繁体   中英

Python: how to extract data from a text?

I used beautifulsoup library to get data from a webpage

http://open.dataforcities.org/details?4[]=2016

import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://open.dataforcities.org/details?4[]=2016').read())

Now soup looks like the following (I show just a part of it):

soup('table):
[<table>\n<tr class="theme-cells" id="profile_indicators" ng-mouseover='updateIndicatorsScroll( "Profile Indicators" )'>\n<td class="theme-text">\n<h1>4 Profile Indicators</h1>\n</td>\n<td class="metrics">\n<div class="metric-p metric-title"></div>\n</td>\n</tr>\n<tr class="indicator-cells" ng-mouseover='updateIndicatorsScroll( "Profile Indicators" )' onmouseout="$(this).removeClass('indicator-cells-hover')" onmouseover="$(this).addClass('indicator-cells-hover')">\n<td class="indicator-text">\n<h2>4.1 Total city population (Profile)</h2>\n</td>\n<td class="metrics">\n<div class="metric-p metric-title"></div>\n<div class="metric-p-also bigger">669 469   (2015)</div>\n<div class="full-bar" style="width:100%">\n<div class="metric-bar" style="width:3.6411942141077174%; background-color:#ffffff"></div>\n</div>\n</td>\n</tr>\n<tr class="indicator-cells" ng-mouseover='updateIndicatorsScroll( "Profile Indicators" )' onmouseout="$(this).removeClass('indicator-cells-hover')" onmouseover="$(this).addClass('indicator-cells-hover')">\n<td class="indicator-text">\n<h2>4.2 City land area (Profile)</h2>\n</td>\n<td class="metrics">\n<div class="metric-p metric-title"></div>\n<div class="metric-p-also bigger">125 km\xb2 (2010)</div>\n<div class="full-bar" style="width:100%">\n<div class="metric-bar" style="width:1.9604120789229098%; background-color:#ffffff"></div>\n</div>\n</td>\n</tr>\n<tr class="indicator-cells" ng-mouseover='updateIndicatorsScroll( "Profile Indicators" )' onmouseout="$(this).removeClass('indicator-cells-hover')" onmouseover="$(this).addClass('indicator-cells-hover')">\n<td class="indicator-text">\n<h2>4.3 Population density (Profile)</h2>\n</td>\n<td class="metrics">\n<div class="metric-p metric-title"></div>\n<div class="metric-p-also bigger">5 354 /km\xb2 (2015)</div>\n<div class="full-bar" style="width:100%">\n<div class="metric-bar" style="width:27.890485963282238%; background-color:#ffffff"></div>\n</div>\n</td>\n</tr>\n<tr class="indicator-cells" ng-mouseover='updateIndicatorsScroll( "Profile Indicators" )'

How can I extract data from soup ? If I follow the example in Web scraping with Python I got the following error:

soup = BeautifulSoup(urllib2.urlopen('http://open.dataforcities.org/details?4[]=2016').read())

for row in soup('table', {'class': 'metrics'})[0].tbody('tr'):
    tds = row('td')
    print tds[0].string, tds[1].string

IndexError                                Traceback (most recent call last)
<ipython-input-71-d688ff354182> in <module>()
----> 1 for row in soup('table', {'class': 'metrics'})[0].tbody('tr'):
      2     tds = row('td')
      3     print tds[0].string, tds[1].string

IndexError: list index out of range

The table in your html has no 'metrics' class, so your expression ( 'table.metrics' ) returns an empty list, which gives you an IndexError when you try to select the first item.

Since there is only one table on the page, and it has no attributes, you can get all the rows with this expression: 'table tr'

import urllib2
from bs4 import BeautifulSoup

html = urllib2.urlopen('http://open.dataforcities.org/details?4[]=2016').read()
soup = BeautifulSoup(html, 'html.parser')

for row in soup.select('table tr'):
    tds = row('td')
    print tds[0].text.strip(), tds[1].text.strip()

Also make sure to use bs4 instead of bs3 , and if possible update to Python3.

Basically this code extracts your data and saves them into into a csv for you to acces(btw i feel like your data is incomplete) I would recommend opening that link and downloading the file as a html file because UnicodeEncodeError if you try to use urlopener to extract it

from bs4 import BeautifulSoup
import csv

soup=BeautifulSoup(open("Yourfile.html"),"html.parser")

f = csv.writer(open("file.csv", "w"))
f.writerow(["Information"]) 


h2s=soup.find_all("h2")

for h2 in h2s:
    name=h2.contents[0]
    f.writerow([name])

By the way incase you want to still use urlopener urllib2 doesnt exist anymore so it is actually

from urllib.request import urlopen
html =urlopen('http://open.dataforcities.org/details?4[]=2016').read()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM