简体   繁体   English

BeautifulSoup:为什么会出现内部服务器错误?

[英]BeautifulSoup: Why am i getting an internal server error?

I wanted to scrape the table on this page.我想刮掉这个页面上的表格。

I wrote this code:我写了这段代码:

import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
import sys
import requests
import pandas as pd

webpage = 'https://web.iitm.ac.in/bioinfo2/cpad2/peptides/?page=1'
page = urllib.request.urlopen(webpage)
soup = BeautifulSoup(page,'html.parser')
soup_text = soup.get_text()
print(soup)

The output is an error: output 是一个错误:

Traceback (most recent call last):
  File "scrape_cpad.py", line 9, in <module>
    page = urllib.request.urlopen(webpage)
  File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/Users/kela/anaconda/envs/py3/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 500: Internal Server Error

I've tried on two different computers and networks.我在两台不同的计算机和网络上尝试过。 But also, I can see the server is running, because I can visit the page via HTML and also view the source code of the page.而且,我可以看到服务器正在运行,因为我可以通过 HTML 访问该页面并查看该页面的源代码。

I also tried changing the URL from https to http or www.我还尝试将 URL 从 https 更改为 http 或 www。

Could someone show me what is the working code to be able to connect to this page to pull down the table?有人可以告诉我能够连接到此页面以拉下表格的工作代码是什么吗?

ps I've seen that there are similar questions eg here and here , but not one that answers my question. ps我已经看到有类似的问题,例如herehere ,但没有一个回答我的问题。

Use requests module to grab the page.使用requests模块来抓取页面。

For example:例如:

import requests
from bs4 import BeautifulSoup


url = 'https://web.iitm.ac.in/bioinfo2/cpad2/peptides/?page=1'
soup = BeautifulSoup(requests.get(url).content ,'html.parser')

for tr in soup.select('tr[data-toggle="modal"]'):
    print(tr.get_text(strip=True, separator=' '))
    print('-' * 120)

Prints:印刷:

P-0001 GYE 3 Amyloid Amyloid-beta precursor protein (APP) P05067 No Org Lett. 2008 Jul 3;10(13):2625-8. 18529009 CPAD
------------------------------------------------------------------------------------------------------------------------
P-0002 KFFE 4 Amyloid J Biol Chem. 2002 Nov 8;277(45):43243-6. 12215440 CPAD
------------------------------------------------------------------------------------------------------------------------
P-0003 KVVE 4 Amyloid J Biol Chem. 2002 Nov 8;277(45):43243-6. 12215440 CPAD
------------------------------------------------------------------------------------------------------------------------
P-0004 NNQQ 4 Amyloid Eukaryotic peptide chain release factor GTP-binding subunit (ERF-3) P05453 Nature. 2007 May 24;447(7143):453-7. 17468747 CPAD
------------------------------------------------------------------------------------------------------------------------
P-0005 VKSE 4 Non-amyloid Microtubule-associated protein tau (PHF-tau) P10636 Proc Natl Acad Sci U S A. 2000 May 9;97(10):5129-34. 10805776 AmyLoad
------------------------------------------------------------------------------------------------------------------------
P-0006 AILSS 5 Amyloid Islet amyloid polypeptide (Amylin) P10997 No Proc Natl Acad Sci U S A. 1990 Jul;87(13):5036-40. 2195544 CPAD
------------------------------------------------------------------------------------------------------------------------


...and so on.

soup = BeautifulSoup(page,'html.parser').context

Seems like the server rejects requests that come without a proper User-Agent header.似乎服务器拒绝了没有适当User-Agent请求。

I tried setting the User-Agent to my browsers, and I managed to make it respond with a HTML page:我尝试将 User-Agent 设置为我的浏览器,并设法使其响应 HTML 页面:

webpage = 'https://web.iitm.ac.in/bioinfo2/cpad2/peptides/?page=1'
req = urllib.request.Request(webpage)
# spoof the UA header
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) 
Gecko/20100101 Firefox/77.0')

page = urllib.request.urlopen(req)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么我收到内部服务器错误 - Why am I getting an Internal Server error 为什么我会收到 500 内部服务器错误消息? - Why am I getting the 500 Internal Server Error message? 为什么通过python请求调用rest api时出现500 Internal Server Error? - Why am I getting 500 Internal Server Error when calling post rest api via python request? 为什么在使用MySQLdb连接时,烧瓶中出现500个内部服务器错误? - Why am I Getting 500 internal server error in flask while connecting using MySQLdb? "为什么在使用 Heroku 部署 Django 时出现内部服务器错误(管理页面仍然有效)?" - Why am I getting an internal server error when deploying Django with Heroku (admin page still works)? 为什么在导入本地 python 文件时出现 500 内部服务器错误? - Why am I getting a 500 Internal Server Error when importing local python file? BeautifulSoup - 内部服务器错误 - BeautifulSoup - Internal Server Error 为什么我会收到一个使用 BeautifulSoup 抓取的 AttributeError 错误? - Why am I getting an AttributeError scraping with BeautifulSoup? 我收到“内部检查系统错误” - i am getting “internal check system error” 对Yelp API进行身份验证时,我在本地Python服务器上收到内部服务器错误 - I am getting an internal server error on my local Python server when authenticating to Yelp API
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM