[英]Scraping a table with row labels in Python using Beautiful Soup
我正在嘗試從具有行標簽的網站上抓取一張桌子。 我可以從表中獲取實際數據,但是我也不知道如何獲取行標簽。
現在是我的代碼:
import numpy as np
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup
url = "http://www12.statcan.gc.ca/census-recensement/2016/dp-pd/dt-td/Rp-eng.cfm?TABID=2&LANG=E&A=R&APATH=3&DETAIL=0&DIM=0&FL=A&FREE=0&GC=01&GL=-1&GID=1341679&GK=1&GRP=1&O=D&PID=110719&PRID=10&PTYPE=109445&S=0&SHOWALL=0&SUB=0&Temporal=2017&THEME=125&VID=0&VNAMEE=&VNAMEF=&D1=0&D2=0&D3=0&D4=0&D5=0&D6=0"
res = urllib.request.urlopen(url)
html = res.read()
## parse with BeautifulSoup
bs = BeautifulSoup(html, "html.parser")
tables = bs.find_all("table")
table = tables[0]
df = pd.DataFrame()
rows = table.find_all("tr")
#extract the first column name (Employment income groups (18))
column_names = []
header_cells = rows[0].find_all("th")
for cell in header_cells:
header = cell.text
header = header.strip()
header = header.replace("\n", " ")
column_names.append(header)
#extract the rest of the column names
header_cells = rows[1].find_all("th")
for cell in header_cells:
header = cell.text
header = header.strip()
header = header.replace("\n", " ")
column_names.append(header)
#this is an extra label
column_names.remove('Main mode of commuting (10)')
#get the data from the table
data = []
for row in rows[2:]:
## create an empty tuple
dt = ()
cells = row.find_all("td")
for cell in cells:
## dp stands for "data point"
font = cell.find("font")
if font is not None:
dp = font.text
else:
dp = cell.text
dp = dp.strip()
dp = dp.replace("\n", " ")
## add to tuple
dt = dt + (dp,)
data.append(dt)
df = pd.DataFrame(data, columns = column_names)
創建數據框會產生錯誤,因為上面的代碼僅提取具有數據點的單元格,而不會提取包含行標簽的每一行的第一個單元格。
也就是說,有11個列名,但是元組只有10個值,因為它不是“行”類型,因此沒有提取行標簽(即,總收入-就業收入)。
在處理表中的其余數據時,如何獲取行標簽並將其放入元組?
謝謝您的幫助。
(如果代碼不明確,我要抓取的表位於此站點上)
使用此table.findAll('th',{'headers':'col-0'})
查找行標簽
lab = []
labels = table.findAll('th',{'headers':'col-0'})
for label in labels:
data = str(label.text).strip()
data = str(data).split("($)Footnote", 1)[0]
lab.append(data)
#print(data)
編輯:使用pandas.read_html
import numpy as np
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup
url = "http://www12.statcan.gc.ca/census-recensement/2016/dp-pd/dt-td/Rp-eng.cfm?TABID=2&LANG=E&A=R&APATH=3&DETAIL=0&DIM=0&FL=A&FREE=0&GC=01&GL=-1&GID=1341679&GK=1&GRP=1&O=D&PID=110719&PRID=10&PTYPE=109445&S=0&SHOWALL=0&SUB=0&Temporal=2017&THEME=125&VID=0&VNAMEE=&VNAMEF=&D1=0&D2=0&D3=0&D4=0&D5=0&D6=0"
res = urllib.request.urlopen(url)
html = res.read()
## parse with BeautifulSoup
bs = BeautifulSoup(html, "html.parser")
tables = bs.find_all("table")
df = (pd.read_html(str(tables)))[0]
#print(df)
columns = ['Employment income groups (18)','Total - Main mode of commuting','Car, truck or van','Driver, alone',
'2 or more persons shared the ride to work','Driver, with 1 or more passengers',
'Passenger, 2 or more persons in the vehicle','Sustainable transportation',
'Public transit','Active transport','Other method']
df.columns = columns
編輯2:元素將無法通過索引訪問,因為字符串不是正確的字符串(就業收入組(18)列標簽)。 我再次編輯了代碼。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.