简体   繁体   English

Python Web 抓取 - 如何 24/7 全天候抓取新闻网站以获取新文章?

[英]Python Web Scraping - How to scrape a News website 24/7 for new articles?

Didn't find an answer regarding my particular question, so I'm sorry if this has been asked already.没有找到关于我的特定问题的答案,所以如果已经有人问过这个问题,我很抱歉。

I've created a Python program that scrapes the articles posted on news websites by certain keywords.我创建了一个 Python 程序,它通过某些关键字抓取新闻网站上发布的文章。 On average, when running it once in the evening, it would be searching through 2000 articles of the day.平均而言,在晚上运行一次时,它将搜索一天中的 2000 篇文章。 Now I obviously want this program to run on loop 24/7 looking for new articles in realtime (or every 5 minutes).现在我显然希望这个程序在 24/7 循环上运行,实时(或每 5 分钟)寻找新文章。 When it hits something based on my keywords, I get notified.当它根据我的关键字命中某些内容时,我会收到通知。

Therefore, I wanted to know whether you guys have any good recommendations on hosting?因此,我想知道你们对托管有什么好的建议吗? I've heard about AWS Lambda but wanted to get a second opinion.我听说过 AWS Lambda,但想获得第二意见。 Anything that costs below -$250 a month is possible:) Maybe someone has a similar project running or can confirm my idea with AWS.任何低于每月 250 美元的费用都是可能的:) 也许有人正在运行类似的项目,或者可以通过 AWS 确认我的想法。 Thanks in advance!提前致谢!

There are basically 2 options that come to mind:基本上有两种选择:

  1. You can either provide your own host to run your code 24/7, eg an old laptop or PC you're not using, effectively paying only for electricity.您可以提供自己的主机来运行您的代码 24/7,例如您不使用的旧笔记本电脑或 PC,实际上只支付电费。 This method won't allow for any scaling if you wish to scale up later though (assuming you don't want to buy a new hardware).但是,如果您希望稍后进行扩展,此方法将不允许任何扩展(假设您不想购买新硬件)。
  2. You can use public cloud (AWS, GCP, etc).您可以使用公共云(AWS、GCP 等)。 AWS Lambda or a dedicated EC2 are the first that come to mind, they are relatively easy to set up and run code on.首先想到的是 AWS Lambda或专用EC2 ,它们相对容易设置和运行代码。 Actual costs can vary depending on AWS region, instance type, usage time, and other factors (eg will you be using S3 as well?), but you could keep them below $250 per month without too much trouble.实际成本可能因 AWS 区域、实例类型、使用时间和其他因素而异(例如,您也会使用S3吗?),但您可以将其控制在每月 250 美元以下,而不会有太多麻烦。 Small size Lambdas and EC2s are quite cheap to use and you could easily scale up if you need more resources.小型 Lambda 和 EC2 使用起来非常便宜,如果您需要更多资源,您可以轻松扩展。

Option 2 is better:)选项2更好:)

Great question, once your script starts do you ever run new scripts or can you just leave the terminal running?很好的问题,一旦你的脚本启动,你会运行新脚本还是让终端继续运行?

In the latter case, you need Amazon ec2, not Lambda.在后一种情况下,您需要 Amazon ec2,而不是 Lambda。 Lambda is for running functions, an Ec2 is the "cloud computer" that you are looking for to "host" and run your program. Lambda 用于运行功能,Ec2 是您正在寻找“托管”并运行程序的“云计算机”。

Look into Ec2, and use EBS or EFB for storage.查看 Ec2,并使用 EBS 或 EFB 进行存储。 S3 is good for storing images, or links, or objects, but if you are using an Ec2 instance (cloud computer) and don't need to store your data as an object and don't need to use a dedicated MYSQL or NOSQL database, just store the info in your EBS or EFB. S3 适用于存储图像、链接或对象,但如果您使用 Ec2 实例(云计算机)并且不需要将数据存储为 object 并且不需要使用专用的 MYSQL 或 Z8ABB792A39DA015ECZEC7521 数据库,只需将信息存储在您的 EBS 或 EFB 中。 Remember, EBS and EFB are the hard drive of the computer (your ec2), and Amazon RDS is database, Amazon Aurora is inside RDS and is for MYSQL, PostGRESL, and S3 is like a image / object drive.请记住,EBS 和 EFB 是计算机的硬盘驱动器(您的 ec2),Amazon RDS 是数据库,Amazon Aurora 在 RDS 内部,用于 MYSQL、PostGRESL 和 S3 就像一个图像 / object 驱动器。 For example, if you had an ebook you were going to distribute, you would store your ebook in S3.例如,如果您有一本要分发的电子书,您可以将您的电子书存储在 S3 中。

You can set up an Ec2 and EBS for free too.您也可以免费设置 Ec2 和 EBS。 Just use the free tier and use the t2.micro for ec2 instance.只需使用免费层并将 t2.micro 用于 ec2 实例。 See how it runs for a few days and then go bigger when necessary.看看它如何运行几天,然后在必要时将 go 更大。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM