简体   繁体   English

Scrapy - 如何使用 AWS Lambda 函数运行爬虫?

[英]Scrapy - How to run spiders with AWS Lambda functions?

Currently I have two small projects using Scrapy. One project is basically to scrape URL's, while the other is only to scrape products of the scrape URL's.目前我有两个使用 Scrapy 的小项目。一个项目基本上是抓取 URL,而另一个只是抓取抓取 URL 的产品。 The directory structure is this:目录结构是这样的:

.
├── requirements.txt
├── .venv
├── url
|   ├── geckodriver
|   ├── scrapy.cfg
|   ├── url
|   |   ├── items.py
|   |   ├── middlewares.py
|   |   ├── pipelines.py
|   |   ├── settings.py
|   |   ├── spiders
|   |   |    ├── store1.py
|   |   |    ├── store2.py
|   |   |    ├── ...
├── product
|   ├── geckodriver
|   ├── scrapy.cfg
|   ├── product
|   |   ├── items.py
|   |   ├── middlewares.py
|   |   ├── ...

When I want to run a spider using the command, I always must follow this path: ~/search/url$ scrapy crawl store1 or ~/search/product$ scrapy crawl store1 .当我想使用命令运行蜘蛛时,我总是必须遵循以下路径: ~/search/url$ scrapy crawl store1~/search/product$ scrapy crawl store1

How can I deploy and run this project using AWS lambda functions?如何使用 AWS lambda 函数部署和运行该项目?

Hello I know I am too late and I am sorry for this你好,我知道我来晚了,对此我感到抱歉

this code is a part of script used in a previous project for a client just replace spider_class_getting_from_spiders with your spider class此代码是先前项目中为客户使用的脚本的一部分,只需将 spider_class_getting_from_spiders 替换为您的蜘蛛 class

thank you谢谢你

import imp
import sys
from crochet import setup, wait_for
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from settings import *
from spiders import *


setup()
sys.modules["sqlite"] = imp.new_module("sqlite")
sys.modules["sqlite3.dbapi2"] = imp.new_module("sqlite.dbapi2")



@wait_for(900) # maximum 15 minutes 
def crawl(spider_class_getting_from_spiders):
    '''
    wait_for(Timeout = inseconds)
    change the timeout accordingly
    this function will raise crochet.TimeoutError if more than 900
    seconds elapse without an answer being received

    '''

    configure_logging({'LOG_LEVEL': 'ERROR'})
    process = CrawlerRunner(DOWNLOADER_MIDDLEWARES);
    d = process. Crawl(spider_class_getting_from_spiders);
    return d;


def lambda_handler(event, context):

    crawl(spider_class_getting_from_spiders)
    # it return the whole event instead of return on the response to send the input to next state in the step function
    event['statusCode'] = 200
    return event

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM