简体   繁体   English

我需要获取新闻文章数据。 我正在使用来自python的请求/获取,但出现此错误:403禁止

[英]I need to get news article data. I'm using request/get from python but I got this error: 403 forbidden

Here is the code: 这是代码:

from requests import get
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}

url = 'https://business.inquirer.net/category/latest-stories/page/10'
response = get(url)
print(response.text[:500])
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)

And this is the result i got: 这就是我得到的结果:

<html>
<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx</center>
</body>
</html>

I have read that putting a header will solve the error but I tried putting the header which i copied from the devtool when i inspected the site but it doesn't solve my problem please help me 我已经读到放置标头可以解决该错误,但是我尝试放置在检查站点时从devtool复制的标头,但不能解决我的问题,请帮助我

You don't use headers variable anywhere so, you don't pass it with a request. 您不会在任何地方使用标头变量,因此,您不会在请求中传递它。 You can do that with code like this: 您可以使用以下代码执行此操作:

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

siteurl = "https://business.inquirer.net/category/latest-stories/page/10"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(siteurl,headers=hdr)
page = urlopen(req)
soup = BeautifulSoup(page)
print(soup)

When try to scrap data from this site with using BeautifulSoap site doesn't display their data. 当尝试使用BeautifulSoap网站从该网站刮取数据时,不会显示其数据。

When you try : 当您尝试:

from bs4 import BeautifulSoup
from urllib import urlopen

url = "https://business.inquirer.net/category/latest-stories/page/10"

open_page = urlopen(url)
source = BeautifulSoup(open_page,"html.parser")

print source

You will see a line like : 您将看到如下一行:

<p>The owner of this website (business.inquirer.net) has banned your access based on your browser's signature (4af0dedd3eebcb40-ua48).</p>

So dont attent to do it with using BeautifulSoap . 因此,不要试图使用BeautifulSoap来做到这一点。 Using Selenium is more easier. 使用Selenium更容易。

from selenium import webdriver


options = webdriver.ChromeOptions()
driver=webdriver.Chrome(chrome_options=options, executable_path=r'your driver path')
driver.get('https://business.inquirer.net/category/latest-stories/page/10')

x = driver.find_elements_by_css_selector("div[id='ch-ls-head']")


for a in x:
  print a.text
driver.close()

OUTPUT : 输出:

TAXATION
DOF clarifies: Rice tariffication law takes effect on March 5
FEBRUARY 19, 2019 BY:  BEN O. DE VERA
BANKS
HSBC reports net profit at $12.6B in 2018
FEBRUARY 19, 2019
CURRENCIES
Asian shares gain on hopes for progress on China-US trade
FEBRUARY 19, 2019
ECONOMY
Amro sees higher PH growth in 2019 on easing inflation, infra boost
FEBRUARY 19, 2019 BY:  BEN O. DE VERA
TELECOMMUNICATIONS
Poe to DICT: Stop ‘dilly-dallying’ over 3rd telco project
FEBRUARY 19, 2019 BY:  CHRISTIA MARIE RAMOS
SOCIAL SECURITY
SSS contribution collections grow by P22.19B in 2018
FEBRUARY 18, 2019 BY:  CHRISTIA MARIE RAMOS
STOCKS
World stocks mixed ahead of further China-US trade talks
FEBRUARY 18, 2019
TRADE
Rice tariffication starts on March 3
FEBRUARY 18, 2019 BY:  BEN O. DE VERA
AGRICULTURE/AGRIBUSINESS
NFA-Bohol workers wear black to mourn ‘death of the rice industry’
FEBRUARY 18, 2019 BY:  LEO UDTOHAN
BONDS
Treasury: RTBs to be sold to individual investors online in Q1
FEBRUARY 18, 2019 BY:  BEN O. DE VERA

Simply worked for me 只是为我工作

from bs4 import BeautifulSoup
import urllib.request 
response = urllib.request.urlopen('https://business.inquirer.net/category/latest-stories/page/10') 
html = response.read()
soup = BeautifulSoup(html,"html5lib")
text = soup.get_text(strip=True)
print (text)

Try including a header, many sites block requests without headers: 尝试包含标头,许多站点会阻止没有标头的请求:

r = requests.get(url, headers=...)

Check the requests docs for more info: http://docs.python-requests.org/en/master/user/quickstart/ 检查请求文档以获取更多信息: http : //docs.python-requests.org/en/master/user/quickstart/

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Kivy:我正在尝试使用MapView,我收到以下错误:“Downloader error:HTTPError(&#39;403 Client Error:Forbidden for url:...)” - Kivy: I'm trying to use MapView and I get the following error: “Downloader error: HTTPError('403 Client Error: Forbidden for url:…)” 获取禁止:403 访问被拒绝,当请求使用 python 将数据从谷歌云存储传输到 bigquery 时 - get Forbidden: 403 Access Denied when do request to transfer data from google cloud storage to bigquery using python 如何绕过 HTTP 错误 403: Forbidden with urllib.request using Python 3 - How to get round the HTTP Error 403: Forbidden with urllib.request using Python 3 Python - 使用字典获取数据时出错,因为它只能从 14 个数据中获取 1 个输出。 谢谢 - Python - Error when get data using dictionary because it was only get 1 output from 14 data. Thx 每当我在我是管理员的服务器上运行此禁令命令时,我都会收到 403 Forbidden permission 错误。 但它适用于我拥有的服务器 - Whenever i run this ban command in a server that im an admin in, i get the 403 Forbidden permission error. But it works in a server that I own 我有一个带有获取请求的循环,在 5 次循环后我得到一个 403 错误 - I have a Loop with a get Request that I get a 403 error after 5 loops 我需要从python中的文本文件中获取特定数据 - I need to get specific data from text file in python 在HTML页面的views.py中使用request.post.get()方法时出现禁止403 CSRF错误 - Forbidden 403 CSRF error when using request.post.get() method in views.py from html page 我无法获取数据。 币安 API - I can't get the data. Binance API 我正在从 Excel 读取数据,我想格式化数字 как в Excel,但出现这个小错误 - I'm reading data from Excel and I want to format numbers как в Excel but I get this small error
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM