简体   繁体   中英

Extract list of values using BeautifulSoup

I'm currently trying to automate the parsing of Ubuntu's Security Notices using their RSS feed . I'm using feedparser so that's working just fine. I can get the title (feed.title) of the advisory, the relevant link (feed.link) to it and so on.

What I'm now trying to do is to further parse the output from this in order to grab the affected versions and store that for further reference.

The following code works in order to grab the feed and get it ready for parsing. It also uses BeautifulSoup to parse feed.summary which seems to be the 'placeholder' that contains the info I'm after.

import feedparser
from bs4 import BeautifulSoup

ubuntu_url = 'https://usn.ubuntu.com/rss.xml'

feed = feedparser.parse(ubuntu_url)

for post in feed.entries:
    soup = BeautifulSoup(post.summary, 'html.parser')

If I add a ' print(soup.prettify()) ' I can see the information that I'm after, in this section (which is part of a much larger output with several other list elements):

<p>A security issue affects these releases of Ubuntu and its derivatives:</p>

<ul>
<li>Ubuntu 18.04 LTS</li>
<li>Ubuntu 17.10</li>
<li>Ubuntu 16.04 LTS</li>
<li>Ubuntu 14.04 LTS</li>
</ul>

There will of course be different lengths of this list, from just one version upwards. As this different example shows:

<p>A security issue affects these releases of Ubuntu and its derivatives:</p>

<ul>
<li>Ubuntu 18.04 LTS</li>
</ul>

I've been trying to figure out how to use BeautifulSoup to parse this and only grab the entries within the ' <ul> </ul> ' section after the 'A security issue affects these releases of Ubuntu and its derivatives:' heading.

I've been looking through the documentation for the correct way of using the 'find_all' functionality but haven't managed to get the puzzle together at this stage.

Any ideas out there?

Thanks in advance.

Using para text

Demo:

from bs4 import BeautifulSoup
s = """<p>A security issue affects these releases of Ubuntu and its derivatives:</p>
<ul>
<li>Ubuntu 18.04 LTS</li>
<li>Ubuntu 17.10</li>
<li>Ubuntu 16.04 LTS</li>
<li>Ubuntu 14.04 LTS</li>
</ul>"""

soup = BeautifulSoup(s, "html.parser")
p_tag = soup.find("p", text="A security issue affects these releases of Ubuntu and its derivatives:")
for li in p_tag.find_next_siblings("ul")[0].find_all("li"):
    print(li.text)

Output:

Ubuntu 18.04 LTS
Ubuntu 17.10
Ubuntu 16.04 LTS
Ubuntu 14.04 LTS

Just use

a = soup.findAll('li')
for b in a:
    print(b)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM