Extracting href using bs4/python3?

Question

i'm new to python and bs4, please go easy on me.

#!/usr/bin/python3
import bs4 as bs
import urllib.request
import time, datetime, os, requests, lxml.html
import re
from fake_useragent import UserAgent

url = "https://www.cvedetails.com/vulnerability-list.php"
ua = UserAgent()
header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
snkr = requests.get(url,headers=header)
soup = bs.BeautifulSoup(snkr.content,'lxml')

for item in soup.find_all('tr', class_="srrowns"):
    print(item.td.next_sibling.next_sibling.a)

prints:

<a href="/cve/CVE-2017-6712/" title="CVE-2017-6712 security vulnerability details">CVE-2017-6712</a>
<a href="/cve/CVE-2017-6708/" title="CVE-2017-6708 security vulnerability details">CVE-2017-6708</a>
<a href="/cve/CVE-2017-6707/" title="CVE-2017-6707 security vulnerability details">CVE-2017-6707</a>
<a href="/cve/CVE-2017-1269/" title="CVE-2017-1269 security vulnerability details">CVE-2017-1269</a>
<a href="/cve/CVE-2017-0711/" title="CVE-2017-0711 security vulnerability details">CVE-2017-0711</a>
<a href="/cve/CVE-2017-0706/" title="CVE-2017-0706 security vulnerability details">CVE-2017-0706</a>

can't figure out how to extract the /cve/CVE-2017-XXXX/ parts. purhaps i've gone about it the wrong way. i dont need the titles or html, just the uri's.

Answer 1

BeautifulSoup generally has too many historical variants for filtering and for fetching things, some of which are more annoying than others. I ignore most of them because it's confusing otherwise.

For attributes I prefer get(), so here item.td.next_sibling.next_sibling.a.get('href') , because it returns None if there is no such attribute, instead of giving an exception.

Answer 2

I am also facing a same issue at initial stages of webscrapping, based on my experience i am sharing here

Great try, for you specific requirements pls try this below code

#python programming language
import bs4 as bs
import urllib.request
import time, datetime, os, requests, lxml.html
import re
from fake_useragent import UserAgent

url = "https://www.cvedetails.com/vulnerability-list.php"
ua = UserAgent()
header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
snkr = requests.get(url,headers=header)
soup = bs.BeautifulSoup(snkr.content,'lxml')

for item in soup.find_all('tr', class_="srrowns"):
   print(item.td.next_sibling.next_sibling.a.get('href'))# you can get only this /cve/CVE-2017-XXXX/ type of links
   print('https://www.cvedetails.com'+item.td.next_sibling.next_sibling.a.get('href'))
#you try this one too, you will get as a working link

( or ) you need to get all href link in one run then try this one

images = []
for item in soup.find_all('a', href=True):
   images.append('https:'+item['href'])
   print(images)

also, dont use fake user agent use pyuser_agent, fake useragent wont work on linux os and server too, not support currently may be future update, it will be supported.

Extracting href using bs4/python3?

Question

2 answers

solution1
2 2017-07-17 14:36:52

solution2
0 2022-04-15 15:07:59

Extracting href using bs4/python3?

Question

2 answers

solution1 2 2017-07-17 14:36:52

solution2 0 2022-04-15 15:07:59

solution1
2 2017-07-17 14:36:52

solution2
0 2022-04-15 15:07:59