简体   繁体   中英

Using beautiful soup - to extract string in a <div> tag?

i am fairly new to bs4for that matter, but im trying to scrape a little chunk of information from a site: but it keeps printing "None" as if the title, or any tag if i replace it, doesn't exists.

The project consits of two parts:

  • the looping-part: (which seems to be pretty straightforward).
  • the parser-part: where i have some issues - see below.

I'm trying to loop through an array of URLs and scrape the data below from a list of wordpress-plugins. See my loop below-

from bs4 import BeautifulSoup
import requests
#array of URLs to loop through, will be larger once I get the loop working correctly
plugins = ['https://wordpress.org/plugins/wp-job-manager', 'https://wordpress.org/plugins/ninja-forms']

The project: for a list of status-data of wordpress-plugins: - approx 50 plugins are of interest!

https://wordpress.org/plugins/wp-job-manager
https://wordpress.org/plugins/ninja-forms
https://wordpress.org/plugins/participants-database ....and so on and so forth.

The parser-part: So this is my approach with beautiful soup - to extract string in a tag ?

import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = "https://wordpress.org/plugins/participants-database/"
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")

ttt = page_soup.find("div", {"class":#post-15991 > div.entry-meta > div.widget.plugin-meta"})
item = ttt.a.text
print(item)

Background : want to fetch the following data from this page:

https://wordpress.org/plugins/participants-database/

i need the data of the following three lines - in the above mentioned example

Version: <strong>1.29.3</strong>
Active installations: <strong>100,000+</strong>
Tested up to: <strong>4.9.4</strong>

see the xpaths that i have found here:

//*[@id="post-15991"]/div[4]/div[1]

//*[@id="post-15991"]/div[4]/div[1]/ul/li[1]
//*[@id="post-15991"]/div[4]/div[1]/ul/li[2]
//*[@id="post-15991"]/div[4]/div[1]/ul/li[3]
//*[@id="post-15991"]/div[4]/div[1]/ul/li[4]
//*[@id="post-15991"]/div[4]/div[1]/ul/li[5]
//*[@id="post-15991"]/div[4]/div[1]/ul/li[6]

You can get required values simply as:

ttt = page_soup.find("div", {"class":"plugin-meta"})
text_nodes = [node.text.strip() for node in ttt.ul.findChildren('li')[:-1:2]]

Output of text_nodes :

['Version: 1.7.7.7', 'Active installations: 10,000+', 'Tested up to: 4.9.4']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM