简体   繁体   中英

article scraping with beautifulsoup: scraping <p> tags inside <div > tags with ids

i wrote a script in python to pull out particular paragraphs but then i end up getting all the information in that page. I want to scrap paragraphs inside with varying ids with different pages eg.

<div id="content-body-123123">

and this id varies for different pages. How can i identify this particular tag and pull out paragraphs inside this tag alone?

url='http://www.thehindu.com/opinion/op-ed/Does-Beijing-really-want-to-
ldquobreak-uprdquo-India/article16875298.ece'
page = requests.get(url)
html=page.content
soup = bs(html, 'html.parser')
for tag in soup.find_all('p'):
    print tag.text.encode('utf-8')+'\n'

Try this. The change of id number should not affect your result:

from bs4 import BeautifulSoup
import requests

url = 'http://www.thehindu.com/opinion/op-ed/Does-Beijing-really-want-to-ldquobreak-uprdquo-India/article16875298.ece'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
for content in soup.select("[id^='content-body-'] p"):
    print(content.text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM