I have Visual Studio Code on a Windows Machine, on which I am making a new Scrapy Crawler. The crawler is working fine but I want to debug the code, for which I am adding this in my launch.json
file:
{
"name": "Scrapy with Integrated Terminal/Console",
"type": "python",
"request": "launch",
"stopOnEntry": true,
"pythonPath": "${config:python.pythonPath}",
"program": "C:/Users/neo/.virtualenvs/Gers-Crawler-77pVkqzP/Scripts/scrapy.exe",
"cwd": "${workspaceRoot}",
"args": [
"crawl",
"amazon",
"-o",
"amazon.json"
],
"console": "integratedTerminal",
"env": {},
"envFile": "${workspaceRoot}/.env",
"debugOptions": [
"RedirectOutput"
]
}
But I am unable to hit any breakpoints. PS: I took the JSON script from here: http://www.stevetrefethen.com/blog/debugging-a-python-scrapy-project-in-vscode
In order to execute the typical scrapy runspider <PYTHON_FILE>
command to must to set the following config into your launch.json
:
{
"version": "0.1.0",
"configurations": [
{
"name": "Python: Launch Scrapy Spider",
"type": "python",
"request": "launch",
"module": "scrapy",
"args": [
"runspider",
"${file}"
],
"console": "integratedTerminal"
}
]
}
Set the breakpoints wherever you want and then debug.
Inside your scrapy project folder create a runner.py
module with the following:
import os from scrapy.cmdline import execute os.chdir(os.path.dirname(os.path.realpath(__file__))) try: execute( [ 'scrapy', 'crawl', 'SPIDER NAME', '-o', 'out.json', ] ) except SystemExit: pass
Place a breakpoint in the line you wish to debug
Run runner.py
with vscode debugger
Configure your json
file like that:
"version": "0.2.0",
"configurations": [
{
"name": "Crawl with scrapy",
"type": "python",
"request": "launch",
"module": "scrapy",
"cwd": "${fileDirname}",
"args": [
"crawl",
"<SPIDER NAME>"
],
"console": "internalConsole"
}
]
Click on the tab in VSCode corresponding to your spider, then launch a debug session corresponding to the json
file.
I made it. The simplest way is to make a runner script runner.py
import scrapy
from scrapy.crawler import CrawlerProcess
from g4gscraper.spiders.g4gcrawler import G4GSpider
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'FEED_FORMAT': 'json',
'FEED_URI': 'data.json'
})
process.crawl(G4GSpider)
process.start() # the script will block here until the crawling is finished
Then I added breakpoints in the spider while I launched debugger on this file. Reference: https://doc.scrapy.org/en/latest/topics/practices.html
Don't need to modify the launch.json, the default "Python: Current File (Integrated Terminal)" works perfectly. For the Python3 project, remember to place the runner.py file at the same level as the scrapy.cfg file (which is the project root).
The runner.py code as @naqushab does above. Note the processs.crawl( className ) , where the className is the spiders class that you want to set the breakpoint at.
You could also try with
{
"configurations": [
{
"name": "Python: Scrapy",
"type": "python",
"request": "launch",
"module": "scrapy",
"cwd": "${fileDirname}",
"args": [
"crawl",
"${fileBasenameNoExtension}",
"--loglevel=ERROR"
],
"console": "integratedTerminal",
"justMyCode": false
}
]
}
but the name of the field should be the same than the spiders name.
The --loglevel=ERROR is to get an output less verbose ;)
I applied @fmango's code and improve it.
Instead of write a seperated runner file, just paste these code lines at the end of spider.
run python debugger. that's all
if __name__ == '__main__':
import os
from scrapy.cmdline import execute
os.chdir(os.path.dirname(os.path.realpath(__file__)))
SPIDER_NAME = MySpider.name
try:
execute(
[
'scrapy',
'crawl',
SPIDER_NAME,
'-s',
'FEED_EXPORT_ENCODING=utf-8',
]
)
except SystemExit:
pass
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.