使用Beautiful Soup在Python中用行標簽刮擦表格

Question

我正在嘗試從具有行標簽的網站上抓取一張桌子。 我可以從表中獲取實際數據，但是我也不知道如何獲取行標簽。

現在是我的代碼：

import numpy as np
import pandas as pd  
import urllib.request
from bs4 import BeautifulSoup

url = "http://www12.statcan.gc.ca/census-recensement/2016/dp-pd/dt-td/Rp-eng.cfm?TABID=2&LANG=E&A=R&APATH=3&DETAIL=0&DIM=0&FL=A&FREE=0&GC=01&GL=-1&GID=1341679&GK=1&GRP=1&O=D&PID=110719&PRID=10&PTYPE=109445&S=0&SHOWALL=0&SUB=0&Temporal=2017&THEME=125&VID=0&VNAMEE=&VNAMEF=&D1=0&D2=0&D3=0&D4=0&D5=0&D6=0"
res = urllib.request.urlopen(url)

html = res.read()

## parse with BeautifulSoup
bs = BeautifulSoup(html, "html.parser")

tables = bs.find_all("table")
table = tables[0]

df = pd.DataFrame()

rows = table.find_all("tr")

#extract the first column name (Employment income groups (18))
column_names = []
header_cells = rows[0].find_all("th") 

for cell in header_cells:
    header = cell.text
    header = header.strip()
    header = header.replace("\n", " ")
    column_names.append(header)

#extract the rest of the column names
header_cells = rows[1].find_all("th") 

for cell in header_cells:
    header = cell.text
    header = header.strip()
    header = header.replace("\n", " ")
    column_names.append(header)

#this is an extra label
column_names.remove('Main mode of commuting (10)')

#get the data from the table
data = []
for row in rows[2:]:

    ## create an empty tuple
    dt = ()

    cells = row.find_all("td")

    for cell in cells:
        ## dp stands for "data point"
        font = cell.find("font")

        if font is not None:
            dp = font.text
        else:
            dp = cell.text

        dp = dp.strip()
        dp = dp.replace("\n", " ")

        ## add to tuple
        dt = dt + (dp,)
    data.append(dt)

df = pd.DataFrame(data, columns = column_names)

創建數據框會產生錯誤，因為上面的代碼僅提取具有數據點的單元格，而不會提取包含行標簽的每一行的第一個單元格。

也就是說，有11個列名，但是元組只有10個值，因為它不是“行”類型，因此沒有提取行標簽（即，總收入-就業收入）。

在處理表中的其余數據時，如何獲取行標簽並將其放入元組？

謝謝您的幫助。

（如果代碼不明確，我要抓取的表位於此站點上）

Answer 1

使用此table.findAll('th',{'headers':'col-0'})查找行標簽

lab = []
labels = table.findAll('th',{'headers':'col-0'})
for label in labels:

    data = str(label.text).strip()
    data = str(data).split("($)Footnote", 1)[0]

    lab.append(data)
    #print(data)

編輯：使用pandas.read_html

import numpy as np
import pandas as pd  
import urllib.request
from bs4 import BeautifulSoup

url = "http://www12.statcan.gc.ca/census-recensement/2016/dp-pd/dt-td/Rp-eng.cfm?TABID=2&LANG=E&A=R&APATH=3&DETAIL=0&DIM=0&FL=A&FREE=0&GC=01&GL=-1&GID=1341679&GK=1&GRP=1&O=D&PID=110719&PRID=10&PTYPE=109445&S=0&SHOWALL=0&SUB=0&Temporal=2017&THEME=125&VID=0&VNAMEE=&VNAMEF=&D1=0&D2=0&D3=0&D4=0&D5=0&D6=0"
res = urllib.request.urlopen(url)

html = res.read()

## parse with BeautifulSoup
bs = BeautifulSoup(html, "html.parser")

tables = bs.find_all("table")

df = (pd.read_html(str(tables)))[0]
#print(df)
columns = ['Employment income groups (18)','Total - Main mode of commuting','Car, truck or van','Driver, alone',
          '2 or more persons shared the ride to work','Driver, with 1 or more passengers',
         'Passenger, 2 or more persons in the vehicle','Sustainable transportation',
         'Public transit','Active transport','Other method']
df.columns = columns

編輯2：元素將無法通過索引訪問，因為字符串不是正確的字符串（就業收入組（18）列標簽）。 我再次編輯了代碼。

使用Beautiful Soup在Python中用行標簽刮擦表格

問題描述

1 個解決方案

解決方案1
0 2018-02-04 07:58:12

使用Beautiful Soup在Python中用行標簽刮擦表格

問題描述

1 個解決方案

解決方案1 0 2018-02-04 07:58:12

解決方案1
0 2018-02-04 07:58:12