简体   繁体   中英

Returning table information based on a condition in Beautiful Soup/Python

I'm trying to scrape this page: https://www.nysenate.gov/legislation/bills/2019/s8450

I only want to pull information from the table (the one that appears when you click "view actions") If it contains the following string: "Delivered To Governor" .

I can iterate through the table, but then I have trouble trying to strip away all the extra tag-text.

url = "https://www.nysenate.gov/legislation/bills/2019/s8450"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")

bill_life_cycle_table = soup.find("tbody")
bill_life_cycle_table

Use the bs4.element.Tag.text method:

from bs4 import BeautifulSoup
import requests
url = "https://www.nysenate.gov/legislation/bills/2019/s8450"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")
bill_life_cycle_table = soup.find("tbody")
print(bill_life_cycle_table.text)

Output:


Dec 11, 2020
delivered to governor

Jul 23, 2020
returned to assemblypassed senate3rd reading cal.908substituted for s8450c

Jul 23, 2020
substituted by a10500c

Jul 22, 2020
ordered to third reading cal.908

Jul 20, 2020
reported and committed to rules

Jul 18, 2020
print number 8450c

Jul 18, 2020
amend and recommit to health

Jul 09, 2020
print number 8450b

Jul 09, 2020
amend and recommit to health

Jun 05, 2020
print number 8450a

Jun 05, 2020
amend and recommit to health

Jun 03, 2020
referred to health 

UPDATE:

For the printing date condition:

from bs4 import BeautifulSoup
import requests
url = "https://www.nysenate.gov/legislation/bills/2019/s8450"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")
bill_life_cycle_table = soup.find("tbody").text.splitlines()
for a, b in zip(bill_life_cycle_table, bill_life_cycle_table[1:]):
    if b.title() == "Delivered To Governor":
        print(a)

Output:

Dec 11, 2020

you can provide if condition to check if string is present in the cell and find the previous cell value. Use css selector select()

from bs4 import BeautifulSoup
import requests

url = "https://www.nysenate.gov/legislation/bills/2019/s8450"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")
tablebody=soup.select_one(".table.c-bill--actions-table > tbody")
for item in tablebody.select("td"):
    if "delivered to governor" in item.text:
        print(item.find_previous("td").text)

Console output:

Dec 11, 2020

You can read in the <table> tag with pandas ' (it uses BeautifulSoup under the hood). then filter by the column and return the date.

Code:

import pandas as pd

url = "https://www.nysenate.gov/legislation/bills/2019/s8450"
df = pd.read_html(url)[0]

date = df[df.iloc[:,-1] == 'delivered to governor'].iloc[0,0]

Output:

print (date)
Dec 11, 2020

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM