Scrapy 在加載 settings.py 之前運行代碼

Question

我有一個使用代理的 web 爬蟲。 我有一個生成 100 個有效代理列表的腳本，然后我將該列表設置為 settings.py 中的代理源。 我的問題是，目前我手動運行生成該文件的腳本，然后運行爬蟲。

如果我希望它在 settings.py 被“處理”之前運行，有誰知道我會把該代碼放在哪里？ 我不想在運行爬蟲之前手動運行該腳本，因為我希望它是自包含的。 ROTATING_PROXY_LIST_PATH = 'C:\\Users\\cmdan\\Desktop\\Spiders\\Michael Mitarotonda\\proxies.txt'

提前致謝！

Answer 1

文檔解釋了從腳本運行 Scrapy的方法。 這意味着它應該允許您在運行爬蟲之前執行一些其他操作，例如您的代理腳本。

你可能想在這個腳本中定義你的爬蟲，或者你可能想導入你的爬蟲，兩者都可以。

import scrapy
from scrapy.crawler import CrawlerProcess

# if you want to import your spider
# from project.spiders import myspider

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

# here comes your script, setting the value of
# ROTATING_PROXY_LIST_PATH

process = CrawlerProcess(settings={
    "FEEDS": {
        "items.json": {"format": "json"},
    },
    "ROTATING_PROXY_LIST_PATH": "path-to-file",
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

Scrapy 在加載 settings.py 之前運行代碼

問題描述

1 個解決方案

解決方案1
0 已采納 2021-03-23 00:21:38

Scrapy 在加載 settings.py 之前運行代碼

問題描述

1 個解決方案

解決方案1 0 已采納 2021-03-23 00:21:38

解決方案1
0 已采納 2021-03-23 00:21:38