简体   繁体   中英

How do Scrapy's feed exports work when writing to the local filesystem?

I currently have a long running Python script in a screen session on an AWS EC2 instance that executes commands like

from subprocess import call 

call('''scrapy crawl my_spider -a year=2005 -a month=may 
--set FEED_URI=/home/ubuntu/my_spider/data/2005_may.json 
--set FEED_FORMAT=jsonlines''', shell=True)

over the set of all possible combinations of year, month for years 2000-2017 and months October-June. Many of the individual commands have completed, and I can reattach to the screen session and see that it's scraping data properly, but I see no files in /home/ubuntu/my_spider/data .

Will the files appear after the Python script completes, or should I stop it now because something is wrong?

当抓取工具启动Spider时, FileFeedStorage打开文件,因此,如果启动后未出现输出文件,则显然出了问题。

Strictly speaking, this doesn't answer the original question, but it still deserves mention. The issue turned out to be that call was not parsing the FEED_URI and FEED_FORMAT options correctly, and thus was not writing the scraped data to the specified file. Why this wasn't propagated back in some way, I don't know. Changing it to

call(["scrapy", "crawl", "my_spider", 
  "-a", "year=2005", 
  "-a", "month=may", 
  "--set",  "FEED_URI=/home/ubuntu/my_spider/data/2005_may.json",
  "--set", "FEED_FORMAT=jsonlines"], cwd="/home/ubuntu/my_spider/")

worked, but it should be said that this is not suggested practice for running Scrapy from a script.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM