簡體   English   中英

Python:如何從文本中提取數據?

[英]Python: how to extract data from a text?

我使用beautifulsoup庫從網頁獲取數據

http://open.dataforcities.org/details?4[]=2016

import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://open.dataforcities.org/details?4[]=2016').read())

現在soup看起來像下面(我只展示了一部分):

soup('table):
[<table>\n<tr class="theme-cells" id="profile_indicators" ng-mouseover='updateIndicatorsScroll( "Profile Indicators" )'>\n<td class="theme-text">\n<h1>4 Profile Indicators</h1>\n</td>\n<td class="metrics">\n<div class="metric-p metric-title"></div>\n</td>\n</tr>\n<tr class="indicator-cells" ng-mouseover='updateIndicatorsScroll( "Profile Indicators" )' onmouseout="$(this).removeClass('indicator-cells-hover')" onmouseover="$(this).addClass('indicator-cells-hover')">\n<td class="indicator-text">\n<h2>4.1 Total city population (Profile)</h2>\n</td>\n<td class="metrics">\n<div class="metric-p metric-title"></div>\n<div class="metric-p-also bigger">669 469   (2015)</div>\n<div class="full-bar" style="width:100%">\n<div class="metric-bar" style="width:3.6411942141077174%; background-color:#ffffff"></div>\n</div>\n</td>\n</tr>\n<tr class="indicator-cells" ng-mouseover='updateIndicatorsScroll( "Profile Indicators" )' onmouseout="$(this).removeClass('indicator-cells-hover')" onmouseover="$(this).addClass('indicator-cells-hover')">\n<td class="indicator-text">\n<h2>4.2 City land area (Profile)</h2>\n</td>\n<td class="metrics">\n<div class="metric-p metric-title"></div>\n<div class="metric-p-also bigger">125 km\xb2 (2010)</div>\n<div class="full-bar" style="width:100%">\n<div class="metric-bar" style="width:1.9604120789229098%; background-color:#ffffff"></div>\n</div>\n</td>\n</tr>\n<tr class="indicator-cells" ng-mouseover='updateIndicatorsScroll( "Profile Indicators" )' onmouseout="$(this).removeClass('indicator-cells-hover')" onmouseover="$(this).addClass('indicator-cells-hover')">\n<td class="indicator-text">\n<h2>4.3 Population density (Profile)</h2>\n</td>\n<td class="metrics">\n<div class="metric-p metric-title"></div>\n<div class="metric-p-also bigger">5 354 /km\xb2 (2015)</div>\n<div class="full-bar" style="width:100%">\n<div class="metric-bar" style="width:27.890485963282238%; background-color:#ffffff"></div>\n</div>\n</td>\n</tr>\n<tr class="indicator-cells" ng-mouseover='updateIndicatorsScroll( "Profile Indicators" )'

如何從soup提取數據? 如果我遵循使用Python進行網絡抓取中的示例, 則會出現以下錯誤:

soup = BeautifulSoup(urllib2.urlopen('http://open.dataforcities.org/details?4[]=2016').read())

for row in soup('table', {'class': 'metrics'})[0].tbody('tr'):
    tds = row('td')
    print tds[0].string, tds[1].string

IndexError                                Traceback (most recent call last)
<ipython-input-71-d688ff354182> in <module>()
----> 1 for row in soup('table', {'class': 'metrics'})[0].tbody('tr'):
      2     tds = row('td')
      3     print tds[0].string, tds[1].string

IndexError: list index out of range

html中的表沒有'metrics'類,因此您的表達式( 'table.metrics' )返回一個空列表,當您嘗試選擇第一個項目時會給您一個IndexError

由於頁面上只有一個表,並且沒有屬性,因此可以使用以下表達式獲取所有行: 'table tr'

import urllib2
from bs4 import BeautifulSoup

html = urllib2.urlopen('http://open.dataforcities.org/details?4[]=2016').read()
soup = BeautifulSoup(html, 'html.parser')

for row in soup.select('table tr'):
    tds = row('td')
    print tds[0].text.strip(), tds[1].text.strip()

還要確保使用bs4而不是bs3 ,如果可能,請更新為Python3。

基本上,這段代碼會提取您的數據並將其保存到csv中供您訪問(順便說一句,我覺得您的數據不完整)。我建議您打開該鏈接並將文件下載為html文件,因為如果嘗試使用urlopener會出現UnicodeEncodeError提取它

from bs4 import BeautifulSoup
import csv

soup=BeautifulSoup(open("Yourfile.html"),"html.parser")

f = csv.writer(open("file.csv", "w"))
f.writerow(["Information"]) 


h2s=soup.find_all("h2")

for h2 in h2s:
    name=h2.contents[0]
    f.writerow([name])

順便說一句,如果您仍然想使用urlopener,則urllib2不再存在,因此它實際上是

from urllib.request import urlopen
html =urlopen('http://open.dataforcities.org/details?4[]=2016').read()

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM