簡體   English   中英

使用Beautiful Soup在Python中用行標簽刮擦表格

[英]Scraping a table with row labels in Python using Beautiful Soup

我正在嘗試從具有行標簽的網站上抓取一張桌子。 我可以從表中獲取實際數據,但是我也不知道如何獲取行標簽。

現在是我的代碼:

import numpy as np
import pandas as pd  
import urllib.request
from bs4 import BeautifulSoup

url = "http://www12.statcan.gc.ca/census-recensement/2016/dp-pd/dt-td/Rp-eng.cfm?TABID=2&LANG=E&A=R&APATH=3&DETAIL=0&DIM=0&FL=A&FREE=0&GC=01&GL=-1&GID=1341679&GK=1&GRP=1&O=D&PID=110719&PRID=10&PTYPE=109445&S=0&SHOWALL=0&SUB=0&Temporal=2017&THEME=125&VID=0&VNAMEE=&VNAMEF=&D1=0&D2=0&D3=0&D4=0&D5=0&D6=0"
res = urllib.request.urlopen(url)

html = res.read()

## parse with BeautifulSoup
bs = BeautifulSoup(html, "html.parser")

tables = bs.find_all("table")
table = tables[0]

df = pd.DataFrame()

rows = table.find_all("tr")

#extract the first column name (Employment income groups (18))
column_names = []
header_cells = rows[0].find_all("th") 

for cell in header_cells:
    header = cell.text
    header = header.strip()
    header = header.replace("\n", " ")
    column_names.append(header)

#extract the rest of the column names
header_cells = rows[1].find_all("th") 

for cell in header_cells:
    header = cell.text
    header = header.strip()
    header = header.replace("\n", " ")
    column_names.append(header)

#this is an extra label
column_names.remove('Main mode of commuting (10)')

#get the data from the table
data = []
for row in rows[2:]:

    ## create an empty tuple
    dt = ()

    cells = row.find_all("td")

    for cell in cells:
        ## dp stands for "data point"
        font = cell.find("font")

        if font is not None:
            dp = font.text
        else:
            dp = cell.text

        dp = dp.strip()
        dp = dp.replace("\n", " ")

        ## add to tuple
        dt = dt + (dp,)
    data.append(dt)

df = pd.DataFrame(data, columns = column_names)

創建數據框會產生錯誤,因為上面的代碼僅提取具有數據點的單元格,而不會提取包含行標簽的每一行的第一個單元格。

也就是說,有11個列名,但是元組只有10個值,因為它不是“行”類型,因此沒有提取行標簽(即,總收入-就業收入)。

在處理表中的其余數據時,如何獲取行標簽並將其放入元組?

謝謝您的幫助。

(如果代碼不明確,我要抓取的表位於此站點上)

使用此table.findAll('th',{'headers':'col-0'})查找行標簽

lab = []
labels = table.findAll('th',{'headers':'col-0'})
for label in labels:

    data = str(label.text).strip()
    data = str(data).split("($)Footnote", 1)[0]

    lab.append(data)
    #print(data)

編輯:使用pandas.read_html

import numpy as np
import pandas as pd  
import urllib.request
from bs4 import BeautifulSoup

url = "http://www12.statcan.gc.ca/census-recensement/2016/dp-pd/dt-td/Rp-eng.cfm?TABID=2&LANG=E&A=R&APATH=3&DETAIL=0&DIM=0&FL=A&FREE=0&GC=01&GL=-1&GID=1341679&GK=1&GRP=1&O=D&PID=110719&PRID=10&PTYPE=109445&S=0&SHOWALL=0&SUB=0&Temporal=2017&THEME=125&VID=0&VNAMEE=&VNAMEF=&D1=0&D2=0&D3=0&D4=0&D5=0&D6=0"
res = urllib.request.urlopen(url)

html = res.read()

## parse with BeautifulSoup
bs = BeautifulSoup(html, "html.parser")

tables = bs.find_all("table")

df = (pd.read_html(str(tables)))[0]
#print(df)
columns = ['Employment income groups (18)','Total - Main mode of commuting','Car, truck or van','Driver, alone',
          '2 or more persons shared the ride to work','Driver, with 1 or more passengers',
         'Passenger, 2 or more persons in the vehicle','Sustainable transportation',
         'Public transit','Active transport','Other method']
df.columns = columns

編輯2:元素將無法通過索引訪問,因為字符串不是正確的字符串(就業收入組(18)列標簽)。 我再次編輯了代碼。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM