BeautifulSoup：为什么会出现内部服务器错误？

Question

I wanted to scrape the table on this page.我想刮掉这个页面上的表格。

I wrote this code:我写了这段代码：

import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
import sys
import requests
import pandas as pd

webpage = 'https://web.iitm.ac.in/bioinfo2/cpad2/peptides/?page=1'
page = urllib.request.urlopen(webpage)
soup = BeautifulSoup(page,'html.parser')
soup_text = soup.get_text()
print(soup)

The output is an error: output 是一个错误：

Traceback (most recent call last):
  File "scrape_cpad.py", line 9, in <module>
    page = urllib.request.urlopen(webpage)
  File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 500: Internal Server Error

I've tried on two different computers and networks.我在两台不同的计算机和网络上尝试过。 But also, I can see the server is running, because I can visit the page via HTML and also view the source code of the page.而且，我可以看到服务器正在运行，因为我可以通过 HTML 访问该页面并查看该页面的源代码。

I also tried changing the URL from https to http or www.我还尝试将 URL 从 https 更改为 http 或 www。

Could someone show me what is the working code to be able to connect to this page to pull down the table?有人可以告诉我能够连接到此页面以拉下表格的工作代码是什么吗？

ps I've seen that there are similar questions eg here and here , but not one that answers my question. ps我已经看到有类似的问题，例如here和here ，但没有一个回答我的问题。

Answer 1

Use requests module to grab the page.使用requests模块来抓取页面。

For example:例如：

import requests
from bs4 import BeautifulSoup


url = 'https://web.iitm.ac.in/bioinfo2/cpad2/peptides/?page=1'
soup = BeautifulSoup(requests.get(url).content ,'html.parser')

for tr in soup.select('tr[data-toggle="modal"]'):
    print(tr.get_text(strip=True, separator=' '))
    print('-' * 120)

Prints:印刷：

P-0001 GYE 3 Amyloid Amyloid-beta precursor protein (APP) P05067 No Org Lett. 2008 Jul 3;10(13):2625-8. 18529009 CPAD
------------------------------------------------------------------------------------------------------------------------
P-0002 KFFE 4 Amyloid J Biol Chem. 2002 Nov 8;277(45):43243-6. 12215440 CPAD
------------------------------------------------------------------------------------------------------------------------
P-0003 KVVE 4 Amyloid J Biol Chem. 2002 Nov 8;277(45):43243-6. 12215440 CPAD
------------------------------------------------------------------------------------------------------------------------
P-0004 NNQQ 4 Amyloid Eukaryotic peptide chain release factor GTP-binding subunit (ERF-3) P05453 Nature. 2007 May 24;447(7143):453-7. 17468747 CPAD
------------------------------------------------------------------------------------------------------------------------
P-0005 VKSE 4 Non-amyloid Microtubule-associated protein tau (PHF-tau) P10636 Proc Natl Acad Sci U S A. 2000 May 9;97(10):5129-34. 10805776 AmyLoad
------------------------------------------------------------------------------------------------------------------------
P-0006 AILSS 5 Amyloid Islet amyloid polypeptide (Amylin) P10997 No Proc Natl Acad Sci U S A. 1990 Jul;87(13):5036-40. 2195544 CPAD
------------------------------------------------------------------------------------------------------------------------


...and so on.

Answer 2

soup = BeautifulSoup(page,'html.parser').context

Answer 3

Seems like the server rejects requests that come without a proper User-Agent header.似乎服务器拒绝了没有适当User-Agent请求。

I tried setting the User-Agent to my browsers, and I managed to make it respond with a HTML page:我尝试将 User-Agent 设置为我的浏览器，并设法使其响应 HTML 页面：

webpage = 'https://web.iitm.ac.in/bioinfo2/cpad2/peptides/?page=1'
req = urllib.request.Request(webpage)
# spoof the UA header
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) 
Gecko/20100101 Firefox/77.0')

page = urllib.request.urlopen(req)

BeautifulSoup：为什么会出现内部服务器错误？

问题描述

3 个解决方案

解决方案1
1 已采纳 2020-07-05 15:23:47

解决方案2
0 2020-07-05 15:19:29

解决方案3
0 2020-07-05 15:23:45

BeautifulSoup：为什么会出现内部服务器错误？

问题描述

3 个解决方案

解决方案1 1 已采纳 2020-07-05 15:23:47

解决方案2 0 2020-07-05 15:19:29

解决方案3 0 2020-07-05 15:23:45

解决方案1
1 已采纳 2020-07-05 15:23:47

解决方案2
0 2020-07-05 15:19:29

解决方案3
0 2020-07-05 15:23:45