[英]BeautifulSoup: Why am i getting an internal server error?
I wanted to scrape the table on this page.我想刮掉这个页面上的表格。
I wrote this code:我写了这段代码:
import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
import sys
import requests
import pandas as pd
webpage = 'https://web.iitm.ac.in/bioinfo2/cpad2/peptides/?page=1'
page = urllib.request.urlopen(webpage)
soup = BeautifulSoup(page,'html.parser')
soup_text = soup.get_text()
print(soup)
The output is an error: output 是一个错误:
Traceback (most recent call last):
File "scrape_cpad.py", line 9, in <module>
page = urllib.request.urlopen(webpage)
File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 532, in open
response = meth(req, response)
File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 570, in error
return self._call_chain(*args)
File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 500: Internal Server Error
I've tried on two different computers and networks.我在两台不同的计算机和网络上尝试过。 But also, I can see the server is running, because I can visit the page via HTML and also view the source code of the page.
而且,我可以看到服务器正在运行,因为我可以通过 HTML 访问该页面并查看该页面的源代码。
I also tried changing the URL from https to http or www.我还尝试将 URL 从 https 更改为 http 或 www。
Could someone show me what is the working code to be able to connect to this page to pull down the table?有人可以告诉我能够连接到此页面以拉下表格的工作代码是什么吗?
ps I've seen that there are similar questions eg here and here , but not one that answers my question. ps我已经看到有类似的问题,例如here和here ,但没有一个回答我的问题。
Use requests
module to grab the page.使用
requests
模块来抓取页面。
For example:例如:
import requests
from bs4 import BeautifulSoup
url = 'https://web.iitm.ac.in/bioinfo2/cpad2/peptides/?page=1'
soup = BeautifulSoup(requests.get(url).content ,'html.parser')
for tr in soup.select('tr[data-toggle="modal"]'):
print(tr.get_text(strip=True, separator=' '))
print('-' * 120)
Prints:印刷:
P-0001 GYE 3 Amyloid Amyloid-beta precursor protein (APP) P05067 No Org Lett. 2008 Jul 3;10(13):2625-8. 18529009 CPAD
------------------------------------------------------------------------------------------------------------------------
P-0002 KFFE 4 Amyloid J Biol Chem. 2002 Nov 8;277(45):43243-6. 12215440 CPAD
------------------------------------------------------------------------------------------------------------------------
P-0003 KVVE 4 Amyloid J Biol Chem. 2002 Nov 8;277(45):43243-6. 12215440 CPAD
------------------------------------------------------------------------------------------------------------------------
P-0004 NNQQ 4 Amyloid Eukaryotic peptide chain release factor GTP-binding subunit (ERF-3) P05453 Nature. 2007 May 24;447(7143):453-7. 17468747 CPAD
------------------------------------------------------------------------------------------------------------------------
P-0005 VKSE 4 Non-amyloid Microtubule-associated protein tau (PHF-tau) P10636 Proc Natl Acad Sci U S A. 2000 May 9;97(10):5129-34. 10805776 AmyLoad
------------------------------------------------------------------------------------------------------------------------
P-0006 AILSS 5 Amyloid Islet amyloid polypeptide (Amylin) P10997 No Proc Natl Acad Sci U S A. 1990 Jul;87(13):5036-40. 2195544 CPAD
------------------------------------------------------------------------------------------------------------------------
...and so on.
soup = BeautifulSoup(page,'html.parser').context
Seems like the server rejects requests that come without a proper User-Agent
header.似乎服务器拒绝了没有适当
User-Agent
请求。
I tried setting the User-Agent to my browsers, and I managed to make it respond with a HTML page:我尝试将 User-Agent 设置为我的浏览器,并设法使其响应 HTML 页面:
webpage = 'https://web.iitm.ac.in/bioinfo2/cpad2/peptides/?page=1'
req = urllib.request.Request(webpage)
# spoof the UA header
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0)
Gecko/20100101 Firefox/77.0')
page = urllib.request.urlopen(req)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.