根據 Beautiful Soup/Python 中的條件返回表信息

Question

我正在嘗試抓取此頁面： https://www.nysenate.gov/legislation/bills/2019/s8450

我只想從表中提取信息（單擊“查看操作”時出現的信息），如果它包含以下字符串： "Delivered To Governor" 。

我可以遍歷表格，但是嘗試剝離所有額外的標記文本時遇到了麻煩。

url = "https://www.nysenate.gov/legislation/bills/2019/s8450"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")

bill_life_cycle_table = soup.find("tbody")
bill_life_cycle_table

Answer 1

使用bs4.element.Tag.text方法：

from bs4 import BeautifulSoup
import requests
url = "https://www.nysenate.gov/legislation/bills/2019/s8450"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")
bill_life_cycle_table = soup.find("tbody")
print(bill_life_cycle_table.text)

Output：


Dec 11, 2020
delivered to governor

Jul 23, 2020
returned to assemblypassed senate3rd reading cal.908substituted for s8450c

Jul 23, 2020
substituted by a10500c

Jul 22, 2020
ordered to third reading cal.908

Jul 20, 2020
reported and committed to rules

Jul 18, 2020
print number 8450c

Jul 18, 2020
amend and recommit to health

Jul 09, 2020
print number 8450b

Jul 09, 2020
amend and recommit to health

Jun 05, 2020
print number 8450a

Jun 05, 2020
amend and recommit to health

Jun 03, 2020
referred to health

更新：

對於打印日期條件：

from bs4 import BeautifulSoup
import requests
url = "https://www.nysenate.gov/legislation/bills/2019/s8450"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")
bill_life_cycle_table = soup.find("tbody").text.splitlines()
for a, b in zip(bill_life_cycle_table, bill_life_cycle_table[1:]):
    if b.title() == "Delivered To Governor":
        print(a)

Output：

Dec 11, 2020

Answer 2

您可以提供 if 條件來檢查單元格中是否存在字符串並查找先前的單元格值。 使用 css 選擇器select()

from bs4 import BeautifulSoup
import requests

url = "https://www.nysenate.gov/legislation/bills/2019/s8450"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")
tablebody=soup.select_one(".table.c-bill--actions-table > tbody")
for item in tablebody.select("td"):
    if "delivered to governor" in item.text:
        print(item.find_previous("td").text)

控制台 output：

Dec 11, 2020

Answer 3

您可以使用pandas ' 閱讀<table>標記（它在引擎蓋下使用 BeautifulSoup）。 然后按列過濾並返回日期。

代碼：

import pandas as pd

url = "https://www.nysenate.gov/legislation/bills/2019/s8450"
df = pd.read_html(url)[0]

date = df[df.iloc[:,-1] == 'delivered to governor'].iloc[0,0]

Output：

print (date)
Dec 11, 2020

根據 Beautiful Soup/Python 中的條件返回表信息

問題描述

3 個解決方案

解決方案1
1 2020-12-16 13:39:27

解決方案2
1 已采納 2020-12-16 13:56:42

解決方案3
1 2020-12-16 14:16:44

根據 Beautiful Soup/Python 中的條件返回表信息

問題描述

3 個解決方案

解決方案1 1 2020-12-16 13:39:27

解決方案2 1 已采納 2020-12-16 13:56:42

解決方案3 1 2020-12-16 14:16:44

解決方案1
1 2020-12-16 13:39:27

解決方案2
1 已采納 2020-12-16 13:56:42

解決方案3
1 2020-12-16 14:16:44