简体   繁体   中英

Extracting href using bs4/python3?

i'm new to python and bs4, please go easy on me.

#!/usr/bin/python3
import bs4 as bs
import urllib.request
import time, datetime, os, requests, lxml.html
import re
from fake_useragent import UserAgent

url = "https://www.cvedetails.com/vulnerability-list.php"
ua = UserAgent()
header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
snkr = requests.get(url,headers=header)
soup = bs.BeautifulSoup(snkr.content,'lxml')

for item in soup.find_all('tr', class_="srrowns"):
    print(item.td.next_sibling.next_sibling.a)

prints:

<a href="/cve/CVE-2017-6712/" title="CVE-2017-6712 security vulnerability details">CVE-2017-6712</a>
<a href="/cve/CVE-2017-6708/" title="CVE-2017-6708 security vulnerability details">CVE-2017-6708</a>
<a href="/cve/CVE-2017-6707/" title="CVE-2017-6707 security vulnerability details">CVE-2017-6707</a>
<a href="/cve/CVE-2017-1269/" title="CVE-2017-1269 security vulnerability details">CVE-2017-1269</a>
<a href="/cve/CVE-2017-0711/" title="CVE-2017-0711 security vulnerability details">CVE-2017-0711</a>
<a href="/cve/CVE-2017-0706/" title="CVE-2017-0706 security vulnerability details">CVE-2017-0706</a>

can't figure out how to extract the /cve/CVE-2017-XXXX/ parts. purhaps i've gone about it the wrong way. i dont need the titles or html, just the uri's.

BeautifulSoup generally has too many historical variants for filtering and for fetching things, some of which are more annoying than others. I ignore most of them because it's confusing otherwise.

For attributes I prefer get(), so here item.td.next_sibling.next_sibling.a.get('href') , because it returns None if there is no such attribute, instead of giving an exception.

I am also facing a same issue at initial stages of webscrapping, based on my experience i am sharing here

Great try, for you specific requirements pls try this below code

#python programming language
import bs4 as bs
import urllib.request
import time, datetime, os, requests, lxml.html
import re
from fake_useragent import UserAgent

url = "https://www.cvedetails.com/vulnerability-list.php"
ua = UserAgent()
header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
snkr = requests.get(url,headers=header)
soup = bs.BeautifulSoup(snkr.content,'lxml')

for item in soup.find_all('tr', class_="srrowns"):
   print(item.td.next_sibling.next_sibling.a.get('href'))# you can get only this /cve/CVE-2017-XXXX/ type of links
   print('https://www.cvedetails.com'+item.td.next_sibling.next_sibling.a.get('href'))
#you try this one too, you will get as a working link

( or ) you need to get all href link in one run then try this one

images = []
for item in soup.find_all('a', href=True):
   images.append('https:'+item['href'])
   print(images)

also, dont use fake user agent use pyuser_agent, fake useragent wont work on linux os and server too, not support currently may be future update, it will be supported.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM