[英]Scrapy spider not executing close method in docker container
我有一个烧瓶应用程序,它将运行一个爬虫蜘蛛。 该应用程序在我的开发机器上运行良好,但是当我在容器中运行它时,不会执行蜘蛛的 close 方法。
这是蜘蛛的代码:
# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup
from scrapy.exceptions import CloseSpider
class ToScrapeCSSSpider(scrapy.Spider):
name = "toscrape-css"
start_urls = [
'http://quotes.toscrape.com/',
]
def parse(self, response):
page_text = response.text
# raise CloseSpider("Blocked")
soup = BeautifulSoup(page_text, "lxml")
if "xml" in str.lower(page_text[:20]):
sitemap = True
links = soup.findAll("loc")
for link in links:
yield scrapy.Request(url=link.text, callback=self.parse)
else:
raise CloseSpider("I want to close it")
def close(spider, reason):
print("Closing spider")
# self.pbar.clear()
# self.pbar.write('Closing {} spider'.format(spider.name))
print("Spider closed")
这是我在 main.py 中的烧瓶应用程序:
import crochet
crochet.setup() # initialize crochet
import json
import pandas as pd
from flask import redirect, url_for, request
from scrapy.crawler import CrawlerRunner, CrawlerProcess
import time
from datetime import datetime, timedelta
import grequests
from flask import render_template, jsonify, Flask, redirect, url_for, request, flash
from app2.articles_finder.spiders.test_spider import ToScrapeCSSSpider
from app2 import app2
@app2.route("/test_docker")
def test_docker():
scrap_docker()
return "Ok",200
@crochet.run_in_reactor
def scrap_docker():
eventual = crawl_runner.crawl(ToScrapeCSSSpider)
eventual.addCallback(finished_docker)
def finished_docker(null):
print("Scrapping is over in docker container")
最后她是我的 docker 文件:
FROM phusion/baseimage:0.9.19
# Use baseimage-docker's init system.
CMD ["/sbin/my_init"]
ENV TERM=xterm-256color
ENV SCRAPPER_HOME=/app/links_finder
ENV PYTHON_VERSION="3.6.5"
ENV FRONT_ADDRESS = blabla
# Set the locale
RUN locale-gen en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US:en
ENV LC_ALL en_US.UTF-8
# Install necessary packages
RUN apt-get update && apt-get install -y \
build-essential
#RUN apt-get update && apt-get install -y \
# build-essential \
# Install core packages
#RUN apt-get update
RUN apt-get install -y build-essential checkinstall software-properties-common llvm cmake wget git nano nasm yasm zip unzip pkg-config \
libreadline-gplv2-dev libncursesw5-dev libssl-dev libsqlite3-dev tk-dev libgdbm-dev libc6-dev libbz2-dev
# Install Python 3.6.5
RUN wget https://www.python.org/ftp/python/${PYTHON_VERSION}/Python-${PYTHON_VERSION}.tar.xz \
&& tar xvf Python-${PYTHON_VERSION}.tar.xz \
&& rm Python-${PYTHON_VERSION}.tar.xz \
&& cd Python-${PYTHON_VERSION} \
&& ./configure \
&& make altinstall \
&& cd / \
&& rm -rf Python-${PYTHON_VERSION}
RUN apt-get install -y python3-pip
WORKDIR ${SCRAPPER_HOME}
COPY . ${SCRAPPER_HOME}
RUN ls
#COPY run_gunicorn_app_2.py ${SCRAPPER_HOME}
RUN pip3 install -r requirements2.txt
RUN chmod 777 -R *
# Clean up
RUN apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
#ENTRYPOINT python3 ${SCRAPPER_HOME}/run_gunicorn_app_2.py
EXPOSE 3456
ENTRYPOINT python3 run_gunicorn_app_2.py
#ENTRYPOINT python3 ${SCRAPPER_HOME}/run_gunicorn_app_2.py
requirements2.txt 文件:
tqdm==4.19.4
APScheduler ==3.6.1
Flask==1.0.2
Flask-Admin==1.3.0
Flask-Bcrypt==0.7.1
Flask-DebugToolbar==0.10.0
Flask-Login==0.3.2
Flask-Mail==0.9.1
Flask-Script==2.0.5
Flask-SQLAlchemy==2.1
Flask-WTF==0.12
Flask-redis==0.4.0
gunicorn==19.4.5
itsdangerous==0.24
pytz==2016.10
structlog==16.1.0
termcolor==1.1.0
WTForms==2.1
scrapy==1.6.0
grequests==0.4.0
#pandas==0.24
crochet==1.10.0
redis==3.3.8
beautifulsoup4==4.7.1
publicsuffixlist==0.7.1
PyMySQL==0.9.3
显然:close 方法根本没有执行。 任何提示? 我已经被这个问题困扰了很长一段时间,所以任何线索都会受到欢迎。 谢谢!
经过大量调试,最终似乎没有问题。 我只需要在 python3 之后添加 -u 来添加日志记录。
ENTRYPOINT python3 -u run_gunicorn_app_2.py
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.