简体   繁体   中英

Find elements by text with Beautifulsoup

I'm just learning Python and I've spent hours try to figure this one thing out. Basically, I have html doc with a repetitive structure and I am trying to pull out certain elements from each repetition. I figured out how to pull out the first element, but I cannot for the life of me figure out to pull any of the others. The first one one easy because it has a distinct class, but the rest don't. Please help before I go insane.

The following is the repetitive section of html. I want to pull out the first header, which I was able to do. I also want to get the "Synopsis" and "Risk Factor".

 <h2 xmlns="" class="classsection4" id="idp201558400">50044 (1) - Ubuntu 6.06 LTS / 8.04 LTS / 9.04 / 9.10 / 10.04 LTS / 10.10 : linux, linux-ec2, linux-source-2.6.15 vulnerabilities (USN-1000-1)</h2> <h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;"> <![endif]]]-->Synopsis</h2> <span xmlns="" class="classtext" style="color: #263645; font-weight: normal;">The remote Ubuntu host is missing one or more security-related patches.</span><h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;"> <![endif]]]-->Description</h2> <span xmlns="" class="classtext" style="color: #263645; font-weight: normal;">This is some description text. (CVE-2010-NNN2).</span><h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;"> <![endif]]]-->Solution</h2> <span xmlns="" class="classtext" style="color: #263645; font-weight: normal;">Update the affected packages.</span><h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;"> <![endif]]]-->Risk Factor</h2> <span xmlns="" class="classtext" style="color: #263645; font-weight: normal;">Critical</span><h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;"> <![endif]]]-->CVSS Base Score</h2> <span xmlns="" class="classtext" style="color: #263645; font-weight: normal;">10.0 (CVSS2#AV:N/AC:L/Au:N/C:C/I:C/A:C)</span><h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;"> <![endif]]]-->CVSS Temporal Score</h2> <span xmlns="" class="classtext" style="color: #263645; font-weight: normal;">8.7 (CVSS2#E:ND/RL:OF/RC:ND)</span><h2 xmlns="" class="classh1 " style="vertical-align: middle;"><!--[if mso]><img src="cid:#" width="1" height="25" border="0" style="display: block; float: left;"> 

Here is my current code:

import requests
from bs4 import BeautifulSoup
import urllib
import re

page = open("C:/Users/AlphaWP/Downloads/631_SupportingFiles4_Labs6-7/Nessus Vulnerability Scan.htm").read()

soup = BeautifulSoup(page, "html.parser")

for section in soup.findAll("h2",{"class":"classsection4"}):
    # nextNode = section
    # print(nextNode.name)
    # print(section)
    print(section.contents)
    print("##############################")
    # print(section.contents)
    for section1 in soup.findAll('h2', text=re.compile(r'Risk')):
        print(section1)
        riskFactor = section1.find("span")
        riskLevel = riskFactor.contents
        print(riskLevel)
    print("##############################")

To get all the span elements use:

spans = soup.find_all('span', {'class': 'classtext'})

spans is now a list of all span elements with class classtext . To access Synopsis span and Risk Factor span:

>>> spans[0]
<span class="classtext" style="color: #263645; font-weight: normal;" xmlns="">The remote Ubuntu host is missing one or more security-related patches.</span>
>>> spans[3]
<span class="classtext" style="color: #263645; font-weight: normal;" xmlns="">Critical</span>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM