I currently have a long running Python script in a screen session on an AWS EC2 instance that executes commands like
from subprocess import call
call('''scrapy crawl my_spider -a year=2005 -a month=may
--set FEED_URI=/home/ubuntu/my_spider/data/2005_may.json
--set FEED_FORMAT=jsonlines''', shell=True)
over the set of all possible combinations of year, month for years 2000-2017 and months October-June. Many of the individual commands have completed, and I can reattach to the screen session and see that it's scraping data properly, but I see no files in /home/ubuntu/my_spider/data
.
Will the files appear after the Python script completes, or should I stop it now because something is wrong?
当抓取工具启动Spider时, FileFeedStorage
打开文件,因此,如果启动后未出现输出文件,则显然出了问题。
Strictly speaking, this doesn't answer the original question, but it still deserves mention. The issue turned out to be that call
was not parsing the FEED_URI
and FEED_FORMAT
options correctly, and thus was not writing the scraped data to the specified file. Why this wasn't propagated back in some way, I don't know. Changing it to
call(["scrapy", "crawl", "my_spider",
"-a", "year=2005",
"-a", "month=may",
"--set", "FEED_URI=/home/ubuntu/my_spider/data/2005_may.json",
"--set", "FEED_FORMAT=jsonlines"], cwd="/home/ubuntu/my_spider/")
worked, but it should be said that this is not suggested practice for running Scrapy from a script.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.