web scraper is being denied by the website even after implementing a user-agent

Question

I'm currently creating a web crawler to gather data from a website for a school project. This issue is that I'm getting the following error code (only from this one webpage):

<h1>You are viewing this page in an unauthorized frame window.</h1>
0
[Finished in 5.4s]

Here is the full code:

#Creating my own webcrawler

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import urllib.request


myurl = 'https://nvd.nist.gov/vuln/data-feeds'
myReq = (myurl)

req = urllib.request.Request(
    myurl, 
    data=None, 
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
    }
) 

#opening my connection, grabbing the page
uClient = uReq(myurl)
page_html = uClient.read()
uClient.close()

#html parsing
page_soup = soup(page_html, 'html.parser')

print(page_soup.h1)

containers = page_soup.findAll('td rowspan="1"',{'class':'x-hidden-focus'})
print(len(containers))

As you can see, I even added a user-agent but I'm still getting this error message. Any help is appreciated!

Answer 1

I believe the first parameter on the 'findAll' method won't help you, so the issue might have nothing to do with the HTTP request-response cycle.

I queried the url you're using and all possible attributes of all 'td' elements on the document are:

{'class': ['xml-file-size', 'file-20']}
{'class': ['xml-file-type', 'file-20']}
{'colspan': '2', 'class': ['xml-file-type', 'file-20']}
{'rowspan': '3'}
{'colspan': '2'}
{}

Which makes querying for 'rowspan' of 1 and 'class' 'x-hidden-focus' return the empty list.

Try on the second to last line:

containers = page_soup.findAll('td', {'colspan'='1', 'class':'file-20'})

or:

containers = page_soup.findAll('td', {'rowspan': '3'})

or even just:

containers = page_soup.findAll('td')

Is up to you which specific 'td' elements you're looking for.

Check out the documentation also to learn about more ways to use BeautifulSoup, including passing functions as arguments, etc.

web scraper is being denied by the website even after implementing a user-agent

Question

1 answers

solution1
0 ACCPTED 2019-03-19 10:19:32

web scraper is being denied by the website even after implementing a user-agent

Question

1 answers

solution1 0 ACCPTED 2019-03-19 10:19:32

solution1
0 ACCPTED 2019-03-19 10:19:32