简体   繁体   English

从Heroku上的URL检索HTML

[英]Retrieving HTML from a URL on Heroku

I use Heroku to host my telegram bot. 我使用Heroku托管电报机器人。 The purpose of the bot is to retrieve html from a webpage and convert it to pdf. 该机器人的目的是从网页检索html并将其转换为pdf。

After successfully hosting it online, i tried sending an URL to the bot, but it freezes/stops at the moment of sending a get request. 在成功将其在线托管之后,我尝试向该机器人发送URL,但是在发送get请求时,它冻结/停止。

Code: 码:

logger.info('retrieving HTML = {}'.format(url))
page_html = requests.get(url)

logger.info('retrieved HTML')
logger.info('started HTML parsing')
soup = BeautifulSoup(page_html.text, 'html.parser')

In the Heroku logs i only see retrieving HTML = <URL> and than the application does not show any sign of activity. 在Heroku日志中,我仅看到retrieving HTML = <URL>并且该应用程序未显示任何活动迹象。

I tried to connect to the dyno (the app itself on heroku) using the heroku console (accessible from the heroku web page), and entered the following code: 我尝试使用heroku控制台(可从heroku网页访问)连接到dyno(heroku上的应用程序本身),并输入以下代码:

import requests
# url of a recepie
url = 'https://pikabu.ru/story/pirog_quotlen__matushkaquot_5332461'
html = requests.get(url)

the execution of this code in the heroku console takes very long time as well and does not finish (no error, no message, i can stop the process with crtl+c, execution never finishes), and i am not sure what could be the problem. 在heroku控制台中执行此代码的时间也很长,并且不会完成(没有错误,没有消息,我可以使用crtl + c停止该过程,执行永远不会结束),我不确定这可能是什么问题。

Thank You in advance, any hint or help would be appreciated. 在此先感谢您,任何提示或帮助将不胜感激。

Without seeing info from your logs, or getting a sense of how big the page you want to scrape is, my guess is that you are hitting Heroku's 30 second timeout. 没有看到日志中的信息,也没有感觉到要抓取的页面有多大,我猜您是在达到Heroku的30秒超时。

From the Dev Center article on timeouts : 有关超时的开发中心文章中

The request must then be processed in the dyno by your application, and a response delivered back to the router, within 30 seconds to avoid the timeout. 然后,必须由您的应用程序在dyno中处理该请求,并在30秒内将响应发送回路由器,以避免超时。

I would check your logs ( heroku logs -t -a yourAppName ) while running the script and look for h12 which is the timeout error code. 我将在运行脚本时检查您的日志( heroku logs -t -a yourAppName ),然后查找h12这是超时错误代码)。 Or if you are using hobby or above dynos, you could check application metrics on the Dashboard. 或者,如果您正在使用爱好或以上的测功机,则可以在仪表板上检查应用程序指标

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM