I have a web scraper, that is scraping pages with many sections of information using bs4. Since a lot of the sections repeat div class
' s it is hard to scrape. I am trying to find a way to get it to start searching the lxml after a specific phrase in the html. is there a way to do this?
below is a small sample of what I am working with, trying to get something like table_soup
to start after a specific phrase.
from bs4 import BeautifulSoup
import csv
import re
# Making get request
r = requests.get('https://m.the-numbers.com/movie/Black-Panther')
# Creating BeautifulSoup object
soup = BeautifulSoup(r.text, 'lxml')
# Localizing table from the BS object
table_soup = soup.find('div', class_='row').find('div', class_='table-responsive').find('table', id='movie_finances')
website = 'https://m.the-numbers.com/'
# Iterating through all trs in the table except the first(header) and the last two(summary) rows
for tr in table_soup.find_all('tr')[1:6]:
tds = tr.find_all('td')
title = tds[0].text.strip()
# make sure that home market performance doesnt check the second one
if title != 'Home Market Performance':
details.append({
'title': title,
'amount': tds[1].text.strip(),
})
summary_soup = soup.find('div', id='summary').find('div', class_='table-responsive').find('table', class_='table table-sm')
summaryList = []
for tr in summary_soup.find_all('tr')[1:4]:
tdmd = tr.find_all('td')
summaryList.append({
'unit': tdmd[1].text.strip(),
})```
from bs4 import BeautifulSoup
import requests
import csv
import re
r = requests.get('https://m.the-numbers.com/movie/Black-Panther')
soup = BeautifulSoup(r.text, 'lxml')
# If you have an unique id for the table you can directly access the table using that id
table_soup = soup.find('table', id='movie_finances')
table_soup.find_all('tr')
# this will get the 4 tr tags in the body. No need to use slicing.
summary_soup = soup.find('div',id="summary").find('div').find_all('div')[3].find('table')
prod_budget_tag, value_tag = summary_soup.find('td', text=u"Production\xa0Budget:").find_parent('tr').find_all('td')
print prod_budget_tag.text, value_tag.text
Similarly you can get other fields and values from the table
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.