使用BeautifulSoup在Web Scraping代码中出错

Question

我想从第1页到最后一页从https://www.cvedetails.com/vulnerability-list/vendor_id-26/product_id-32238/Microsoft-Windows-10.html获取数据，同时按“CVE编号”排序升序“我想要以CSV格式检索的数据是表格标题和表格数据中的所有内容

我一直在尝试一些代码，但它似乎没有用，我现在有点绝望了

https://youtu.be/XQgXKtPSzUI

我试图学习的地方

任何帮助，将不胜感激

我曾经问过这个问题之前我得到的回复很棒但它似乎没有得到我需要的东西而且我对这是如何工作感到困惑而且更多因为这个网站的源代码有多奇怪

#!/usr/bin/env python3
import bs4 # Good HTML parser
from urllib.request import urlopen as uReq # Helps with opening URL
from bs4 import BeautifulSoup as soup

# The target URL
my_url = 'https://www.cvedetails.com/vulnerability-list.php?vendor_id=26&product_id=32238&version_id=&page=1&hasexp=0&opdos=0&opec=0&opov=0&opcsrf=0&opgpriv=0&opsqli=0&opxss=0&opdirt=0&opmemc=0&ophttprs=0&opbyp=0&opfileinc=0&opginf=0&cvssscoremin=0&cvssscoremax=0&year=0&month=0&cweid=0&order=2&trc=851&sha=41e451b72c2e412c0a1cb8cb1dcfee3d16d51c44'

# Check process
# print(my_url)

# Open a connection and grab the webpage and downloads it
uClient = uReq(my_url)

# Save the webpage into a variable
page_html = uClient.read()

# Close the internet connection from uclient
uClient.close()

# Calling soup to parse the html with html parser and saving it to a variable
page_soup = soup(page_html,"html.parser")

print(page_soup.h1)

这是错误代码

Traceback (most recent call last):
  File "./Testing3.py", line 21, in <module>
    uClient = uReq(my_url)
  File "/usr/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/usr/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Answer 1

要避免此错误，您需要通过请求中的标头提供用户代理。

尝试将脚本修改为：

#!/usr/bin/env python3
import bs4
from urllib.request import urlopen as uReq, Request
from bs4 import BeautifulSoup as soup

#bs4 is a good html parser
#urllib.request helps with opening the url


#setting the target url
my_url = 'https://www.cvedetails.com/vulnerability-list.php?vendor_id=26&product_id=32238&version_id=&page=1&hasexp=0&opdos=0&opec=0&opov=0&opcsrf=0&opgpriv=0&opsqli=0&opxss=0&opdirt=0&opmemc=0&ophttprs=0&opbyp=0&opfileinc=0&opginf=0&cvssscoremin=0&cvssscoremax=0&year=0&month=0&cweid=0&order=2&trc=851&sha=41e451b72c2e412c0a1cb8cb1dcfee3d16d51c44'

hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(my_url,headers=hdr)
page = uReq(req)
page_soup = soup(page)

print(page_soup.h1)

Answer 2

而不是urllib为什么不直接使用请求模块。 试试这个代码

import requests
from bs4 import BeautifulSoup as soup
my_url = 'https://www.cvedetails.com/vulnerability-list.php?vendor_id=26&product_id=32238&version_id=&page=1&hasexp=0&opdos=0&opec=0&opov=0&opcsrf=0&opgpriv=0&opsqli=0&opxss=0&opdirt=0&opmemc=0&ophttprs=0&opbyp=0&opfileinc=0&opginf=0&cvssscoremin=0&cvssscoremax=0&year=0&month=0&cweid=0&order=2&trc=851&sha=41e451b72c2e412c0a1cb8cb1dcfee3d16d51c44'

page_html = requests.get(my_url).text

page_soup = soup(page_html,"html.parser")

print(page_soup.h1)

输出：

<h1>
<a href="//www.cvedetails.com/vendor/26/Microsoft.html" title="Details for Microsoft">Microsoft</a> » <a href="//www.cvedetails.com/product/32238/Microsoft-Windows-10.html?vendor_id=26" title="Product Details Microsoft Windows 10">Windows 10</a> : Security Vulnerabilities
</h1>

使用BeautifulSoup在Web Scraping代码中出错

问题描述

2 个解决方案

解决方案1
0 2019-05-15 11:36:56

解决方案2
0 2019-05-15 11:38:35

使用BeautifulSoup在Web Scraping代码中出错

问题描述

2 个解决方案

解决方案1 0 2019-05-15 11:36:56

解决方案2 0 2019-05-15 11:38:35

解决方案1
0 2019-05-15 11:36:56

解决方案2
0 2019-05-15 11:38:35