簡體   English   中英

根據 Beautiful Soup/Python 中的條件返回表信息

[英]Returning table information based on a condition in Beautiful Soup/Python

我正在嘗試抓取此頁面: https://www.nysenate.gov/legislation/bills/2019/s8450

我只想從表中提取信息(單擊“查看操作”時出現的信息),如果它包含以下字符串: "Delivered To Governor"

我可以遍歷表格,但是嘗試剝離所有額外的標記文本時遇到了麻煩。

url = "https://www.nysenate.gov/legislation/bills/2019/s8450"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")

bill_life_cycle_table = soup.find("tbody")
bill_life_cycle_table

使用bs4.element.Tag.text方法:

from bs4 import BeautifulSoup
import requests
url = "https://www.nysenate.gov/legislation/bills/2019/s8450"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")
bill_life_cycle_table = soup.find("tbody")
print(bill_life_cycle_table.text)

Output:


Dec 11, 2020
delivered to governor

Jul 23, 2020
returned to assemblypassed senate3rd reading cal.908substituted for s8450c

Jul 23, 2020
substituted by a10500c

Jul 22, 2020
ordered to third reading cal.908

Jul 20, 2020
reported and committed to rules

Jul 18, 2020
print number 8450c

Jul 18, 2020
amend and recommit to health

Jul 09, 2020
print number 8450b

Jul 09, 2020
amend and recommit to health

Jun 05, 2020
print number 8450a

Jun 05, 2020
amend and recommit to health

Jun 03, 2020
referred to health 

更新:

對於打印日期條件:

from bs4 import BeautifulSoup
import requests
url = "https://www.nysenate.gov/legislation/bills/2019/s8450"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")
bill_life_cycle_table = soup.find("tbody").text.splitlines()
for a, b in zip(bill_life_cycle_table, bill_life_cycle_table[1:]):
    if b.title() == "Delivered To Governor":
        print(a)

Output:

Dec 11, 2020

您可以提供 if 條件來檢查單元格中是否存在字符串並查找先前的單元格值。 使用 css 選擇器select()

from bs4 import BeautifulSoup
import requests

url = "https://www.nysenate.gov/legislation/bills/2019/s8450"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")
tablebody=soup.select_one(".table.c-bill--actions-table > tbody")
for item in tablebody.select("td"):
    if "delivered to governor" in item.text:
        print(item.find_previous("td").text)

控制台 output:

Dec 11, 2020

您可以使用pandas ' 閱讀<table>標記(它在引擎蓋下使用 BeautifulSoup)。 然后按列過濾並返回日期。

代碼:

import pandas as pd

url = "https://www.nysenate.gov/legislation/bills/2019/s8450"
df = pd.read_html(url)[0]

date = df[df.iloc[:,-1] == 'delivered to governor'].iloc[0,0]

Output:

print (date)
Dec 11, 2020

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM