简体   繁体   中英

How do you extract a body paragraph of text through BeautifulSoup?

I am trying to extract text from websites using BeautifulSoup but willing to explore other options. Currently I am trying to use something like this:

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

boston_url = 'https://www.mass.gov/service-details/request-for-proposal-rfp-notices'
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(boston_url,headers=hdr)
webpage = urlopen(req)
htmlText = webpage.read().decode('utf-8')
pageText = BeautifulSoup(htmlText, "html.parser")
body = pageText.find_all(text=True)

The goal being to figure out how to extract the text in the red box.You can see the output I get from the CMD photo below. It is very messy and i'm not sure how to find body paragraphs of text from that. I could loop over the output and look for certain words but I need to do this to multiple sites and I won't know what's in the body paragraph.

?

在此处输入图像描述

It's probably simpler than you make it. Let's try to simplify it:

import requests
from bs4 import BeautifulSoup as bs
boston_url = 'https://www.mass.gov/service-details/request-for-proposal-rfp-notices'
hdr = {'User-Agent': 'Mozilla/5.0'}
req = requests.get(boston_url,headers=hdr)

soup = bs(req.text,'lxml')
soup.select('main main div.ma__rich-text>p')[0].text

Output:

'PERAC has not reviewed the RFP notices or other related materials posted on this page for compliance with M.G.L. Chapter 32, section 23B. The publication of these notices should not be interpreted as an indication that PERAC has made a determination as to that compliance.'

You can use the bs.find('p', text=re.compile('PERAC')) to extract that paragraph:

from bs4 import BeautifulSoup
import requests
import re

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
    'AppleWebKit/537.36 (KHTML, like Gecko) '
    'Chrome/83.0.4103.61 Safari/537.36'
}

boston_url = (
     'https://www.mass.gov/service-details/request-for-proposal-rfp-notices'
)

resp = requests.get(boston_url, headers=headers)
bs = BeautifulSoup(resp.text)
bs.find('p', text=re.compile('PERAC'))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM