I'm trying to scrape this page: https://www.nysenate.gov/legislation/bills/2019/s8450
I only want to pull information from the table (the one that appears when you click "view actions") If it contains the following string: "Delivered To Governor"
.
I can iterate through the table, but then I have trouble trying to strip away all the extra tag-text.
url = "https://www.nysenate.gov/legislation/bills/2019/s8450"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")
bill_life_cycle_table = soup.find("tbody")
bill_life_cycle_table
Use the bs4.element.Tag.text
method:
from bs4 import BeautifulSoup
import requests
url = "https://www.nysenate.gov/legislation/bills/2019/s8450"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")
bill_life_cycle_table = soup.find("tbody")
print(bill_life_cycle_table.text)
Output:
Dec 11, 2020
delivered to governor
Jul 23, 2020
returned to assemblypassed senate3rd reading cal.908substituted for s8450c
Jul 23, 2020
substituted by a10500c
Jul 22, 2020
ordered to third reading cal.908
Jul 20, 2020
reported and committed to rules
Jul 18, 2020
print number 8450c
Jul 18, 2020
amend and recommit to health
Jul 09, 2020
print number 8450b
Jul 09, 2020
amend and recommit to health
Jun 05, 2020
print number 8450a
Jun 05, 2020
amend and recommit to health
Jun 03, 2020
referred to health
UPDATE:
For the printing date condition:
from bs4 import BeautifulSoup
import requests
url = "https://www.nysenate.gov/legislation/bills/2019/s8450"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")
bill_life_cycle_table = soup.find("tbody").text.splitlines()
for a, b in zip(bill_life_cycle_table, bill_life_cycle_table[1:]):
if b.title() == "Delivered To Governor":
print(a)
Output:
Dec 11, 2020
you can provide if condition to check if string is present in the cell and find the previous cell value. Use css selector select()
from bs4 import BeautifulSoup
import requests
url = "https://www.nysenate.gov/legislation/bills/2019/s8450"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")
tablebody=soup.select_one(".table.c-bill--actions-table > tbody")
for item in tablebody.select("td"):
if "delivered to governor" in item.text:
print(item.find_previous("td").text)
Console output:
Dec 11, 2020
You can read in the <table>
tag with pandas
' (it uses BeautifulSoup under the hood). then filter by the column and return the date.
Code:
import pandas as pd
url = "https://www.nysenate.gov/legislation/bills/2019/s8450"
df = pd.read_html(url)[0]
date = df[df.iloc[:,-1] == 'delivered to governor'].iloc[0,0]
Output:
print (date)
Dec 11, 2020
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.