简体   繁体   中英

web scraper is being denied by the website even after implementing a user-agent

I'm currently creating a web crawler to gather data from a website for a school project. This issue is that I'm getting the following error code (only from this one webpage):

<h1>You are viewing this page in an unauthorized frame window.</h1>
0
[Finished in 5.4s]

Here is the full code:

#Creating my own webcrawler

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import urllib.request


myurl = 'https://nvd.nist.gov/vuln/data-feeds'
myReq = (myurl)

req = urllib.request.Request(
    myurl, 
    data=None, 
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
    }
) 

#opening my connection, grabbing the page
uClient = uReq(myurl)
page_html = uClient.read()
uClient.close()

#html parsing
page_soup = soup(page_html, 'html.parser')

print(page_soup.h1)

containers = page_soup.findAll('td rowspan="1"',{'class':'x-hidden-focus'})
print(len(containers))

As you can see, I even added a user-agent but I'm still getting this error message. Any help is appreciated!

I believe the first parameter on the 'findAll' method won't help you, so the issue might have nothing to do with the HTTP request-response cycle.

I queried the url you're using and all possible attributes of all 'td' elements on the document are:

{'class': ['xml-file-size', 'file-20']}
{'class': ['xml-file-type', 'file-20']}
{'colspan': '2', 'class': ['xml-file-type', 'file-20']}
{'rowspan': '3'}
{'colspan': '2'}
{}

Which makes querying for 'rowspan' of 1 and 'class' 'x-hidden-focus' return the empty list.

Try on the second to last line:

containers = page_soup.findAll('td', {'colspan'='1', 'class':'file-20'})

or:

containers = page_soup.findAll('td', {'rowspan': '3'})

or even just:

containers = page_soup.findAll('td')

Is up to you which specific 'td' elements you're looking for.

Check out the documentation also to learn about more ways to use BeautifulSoup, including passing functions as arguments, etc.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM