Debugging Scrapy Project in Visual Studio Code

Question

I have Visual Studio Code on a Windows Machine, on which I am making a new Scrapy Crawler. The crawler is working fine but I want to debug the code, for which I am adding this in my launch.json file:

{
    "name": "Scrapy with Integrated Terminal/Console",
    "type": "python",
    "request": "launch",
    "stopOnEntry": true,
    "pythonPath": "${config:python.pythonPath}",
    "program": "C:/Users/neo/.virtualenvs/Gers-Crawler-77pVkqzP/Scripts/scrapy.exe",
    "cwd": "${workspaceRoot}",
    "args": [
        "crawl",
        "amazon",
        "-o",
        "amazon.json"
    ],
    "console": "integratedTerminal",
    "env": {},
    "envFile": "${workspaceRoot}/.env",
    "debugOptions": [
        "RedirectOutput"
    ]
}

But I am unable to hit any breakpoints. PS: I took the JSON script from here: http://www.stevetrefethen.com/blog/debugging-a-python-scrapy-project-in-vscode

Answer 1

In order to execute the typical scrapy runspider <PYTHON_FILE> command to must to set the following config into your launch.json :

{
    "version": "0.1.0",
    "configurations": [
        {
            "name": "Python: Launch Scrapy Spider",
            "type": "python",
            "request": "launch",
            "module": "scrapy",
            "args": [
                "runspider",
                "${file}"
            ],
            "console": "integratedTerminal"
        }
    ]
}

Set the breakpoints wherever you want and then debug.

Answer 2

Inside your scrapy project folder create a runner.py module with the following:

 import os from scrapy.cmdline import execute os.chdir(os.path.dirname(os.path.realpath(__file__))) try: execute( [ 'scrapy', 'crawl', 'SPIDER NAME', '-o', 'out.json', ] ) except SystemExit: pass

Place a breakpoint in the line you wish to debug
Run runner.py with vscode debugger

Answer 3

Configure your json file like that:

"version": "0.2.0",
"configurations": [
    {
        "name": "Crawl with scrapy",
        "type": "python",
        "request": "launch",
        "module": "scrapy",
        "cwd": "${fileDirname}",
        "args": [
            "crawl",
            "<SPIDER NAME>"
        ],
        "console": "internalConsole"
    }
]

Click on the tab in VSCode corresponding to your spider, then launch a debug session corresponding to the json file.

Answer 4

I made it. The simplest way is to make a runner script runner.py

import scrapy
from scrapy.crawler import CrawlerProcess

from g4gscraper.spiders.g4gcrawler import G4GSpider

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'FEED_FORMAT': 'json',
    'FEED_URI': 'data.json'
})

process.crawl(G4GSpider)
process.start() # the script will block here until the crawling is finished

Then I added breakpoints in the spider while I launched debugger on this file. Reference: https://doc.scrapy.org/en/latest/topics/practices.html

Answer 5

Don't need to modify the launch.json, the default "Python: Current File (Integrated Terminal)" works perfectly. For the Python3 project, remember to place the runner.py file at the same level as the scrapy.cfg file (which is the project root).

The runner.py code as @naqushab does above. Note the processs.crawl( className ) , where the className is the spiders class that you want to set the breakpoint at.

Answer 6

You could also try with

{
  "configurations": [
    {
        "name": "Python: Scrapy",
        "type": "python",
        "request": "launch",
        "module": "scrapy",
        "cwd": "${fileDirname}",
        "args": [
            "crawl",
            "${fileBasenameNoExtension}",
            "--loglevel=ERROR"
        ],
        "console": "integratedTerminal",
        "justMyCode": false
    }
  ]
}

but the name of the field should be the same than the spiders name.

The --loglevel=ERROR is to get an output less verbose ;)

Answer 7

I applied @fmango's code and improve it.

Instead of write a seperated runner file, just paste these code lines at the end of spider.
run python debugger. that's all

if __name__ == '__main__':
    import os
    from scrapy.cmdline import execute

    os.chdir(os.path.dirname(os.path.realpath(__file__)))

    SPIDER_NAME = MySpider.name
    try:
        execute(
            [
                'scrapy',
                'crawl',
                SPIDER_NAME,
                '-s',
                'FEED_EXPORT_ENCODING=utf-8',
            ]
        )
    except SystemExit:
        pass

Debugging Scrapy Project in Visual Studio Code

Question

7 answers

solution1
25 ACCPTED 2020-03-14 19:34:26

solution2
18 2018-10-28 03:05:20

solution3
5 2019-12-19 19:35:12

solution4
2 2018-03-09 21:07:37

solution5
1 2019-02-06 01:04:12

solution6
1 2020-07-08 12:19:46

solution7
1 2020-07-14 15:15:52

Debugging Scrapy Project in Visual Studio Code

Question

7 answers

solution1 25 ACCPTED 2020-03-14 19:34:26

solution2 18 2018-10-28 03:05:20

solution3 5 2019-12-19 19:35:12

solution4 2 2018-03-09 21:07:37

solution5 1 2019-02-06 01:04:12

solution6 1 2020-07-08 12:19:46

solution7 1 2020-07-14 15:15:52

solution1
25 ACCPTED 2020-03-14 19:34:26

solution2
18 2018-10-28 03:05:20

solution3
5 2019-12-19 19:35:12

solution4
2 2018-03-09 21:07:37

solution5
1 2019-02-06 01:04:12

solution6
1 2020-07-08 12:19:46

solution7
1 2020-07-14 15:15:52