I'm currently creating a web crawler to gather data from a website for a school project. This issue is that I'm getting the following error code (only from this one webpage):
<h1>You are viewing this page in an unauthorized frame window.</h1>
0
[Finished in 5.4s]
Here is the full code:
#Creating my own webcrawler
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import urllib.request
myurl = 'https://nvd.nist.gov/vuln/data-feeds'
myReq = (myurl)
req = urllib.request.Request(
myurl,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
#opening my connection, grabbing the page
uClient = uReq(myurl)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, 'html.parser')
print(page_soup.h1)
containers = page_soup.findAll('td rowspan="1"',{'class':'x-hidden-focus'})
print(len(containers))
As you can see, I even added a user-agent but I'm still getting this error message. Any help is appreciated!
I believe the first parameter on the 'findAll' method won't help you, so the issue might have nothing to do with the HTTP request-response cycle.
I queried the url you're using and all possible attributes of all 'td' elements on the document are:
{'class': ['xml-file-size', 'file-20']}
{'class': ['xml-file-type', 'file-20']}
{'colspan': '2', 'class': ['xml-file-type', 'file-20']}
{'rowspan': '3'}
{'colspan': '2'}
{}
Which makes querying for 'rowspan' of 1 and 'class' 'x-hidden-focus' return the empty list.
Try on the second to last line:
containers = page_soup.findAll('td', {'colspan'='1', 'class':'file-20'})
or:
containers = page_soup.findAll('td', {'rowspan': '3'})
or even just:
containers = page_soup.findAll('td')
Is up to you which specific 'td' elements you're looking for.
Check out the documentation also to learn about more ways to use BeautifulSoup, including passing functions as arguments, etc.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.