简体   繁体   English

即使在实施用户代理后,网站也会拒绝Web抓取工具

[英]web scraper is being denied by the website even after implementing a user-agent

I'm currently creating a web crawler to gather data from a website for a school project. 我目前正在创建一个网络爬虫,用于从学校项目的网站收集数据。 This issue is that I'm getting the following error code (only from this one webpage): 这个问题是我收到以下错误代码(仅来自这一个网页):

<h1>You are viewing this page in an unauthorized frame window.</h1>
0
[Finished in 5.4s]

Here is the full code: 这是完整的代码:

#Creating my own webcrawler

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import urllib.request


myurl = 'https://nvd.nist.gov/vuln/data-feeds'
myReq = (myurl)

req = urllib.request.Request(
    myurl, 
    data=None, 
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
    }
) 

#opening my connection, grabbing the page
uClient = uReq(myurl)
page_html = uClient.read()
uClient.close()

#html parsing
page_soup = soup(page_html, 'html.parser')

print(page_soup.h1)

containers = page_soup.findAll('td rowspan="1"',{'class':'x-hidden-focus'})
print(len(containers))

As you can see, I even added a user-agent but I'm still getting this error message. 正如您所看到的,我甚至添加了一个用户代理,但我仍然收到此错误消息。 Any help is appreciated! 任何帮助表示赞赏!

I believe the first parameter on the 'findAll' method won't help you, so the issue might have nothing to do with the HTTP request-response cycle. 我相信'findAll'方法上的第一个参数不会帮助您,因此问题可能与HTTP请求-响应周期无关。

I queried the url you're using and all possible attributes of all 'td' elements on the document are: 我查询了你正在使用的url,文档中所有'td'元素的所有可能属性都是:

{'class': ['xml-file-size', 'file-20']}
{'class': ['xml-file-type', 'file-20']}
{'colspan': '2', 'class': ['xml-file-type', 'file-20']}
{'rowspan': '3'}
{'colspan': '2'}
{}

Which makes querying for 'rowspan' of 1 and 'class' 'x-hidden-focus' return the empty list. 这使得查询'rowspan'为1和'class''x-hidden-focus'返回空列表。

Try on the second to last line: 尝试倒数第二行:

containers = page_soup.findAll('td', {'colspan'='1', 'class':'file-20'})

or: 要么:

containers = page_soup.findAll('td', {'rowspan': '3'})

or even just: 甚至只是:

containers = page_soup.findAll('td')

Is up to you which specific 'td' elements you're looking for. 由您决定要查找哪个特定的“ td”元素。

Check out the documentation also to learn about more ways to use BeautifulSoup, including passing functions as arguments, etc. 查看文档还了解有关使用BeautifulSoup的更多方法,包括将函数作为参数传递等。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM