简体   繁体   中英

Debugging Scrapy Project in Visual Studio Code

I have Visual Studio Code on a Windows Machine, on which I am making a new Scrapy Crawler. The crawler is working fine but I want to debug the code, for which I am adding this in my launch.json file:

{
    "name": "Scrapy with Integrated Terminal/Console",
    "type": "python",
    "request": "launch",
    "stopOnEntry": true,
    "pythonPath": "${config:python.pythonPath}",
    "program": "C:/Users/neo/.virtualenvs/Gers-Crawler-77pVkqzP/Scripts/scrapy.exe",
    "cwd": "${workspaceRoot}",
    "args": [
        "crawl",
        "amazon",
        "-o",
        "amazon.json"
    ],
    "console": "integratedTerminal",
    "env": {},
    "envFile": "${workspaceRoot}/.env",
    "debugOptions": [
        "RedirectOutput"
    ]
}

But I am unable to hit any breakpoints. PS: I took the JSON script from here: http://www.stevetrefethen.com/blog/debugging-a-python-scrapy-project-in-vscode

In order to execute the typical scrapy runspider <PYTHON_FILE> command to must to set the following config into your launch.json :

{
    "version": "0.1.0",
    "configurations": [
        {
            "name": "Python: Launch Scrapy Spider",
            "type": "python",
            "request": "launch",
            "module": "scrapy",
            "args": [
                "runspider",
                "${file}"
            ],
            "console": "integratedTerminal"
        }
    ]
}

Set the breakpoints wherever you want and then debug.

  1. Inside your scrapy project folder create a runner.py module with the following:

     import os from scrapy.cmdline import execute os.chdir(os.path.dirname(os.path.realpath(__file__))) try: execute( [ 'scrapy', 'crawl', 'SPIDER NAME', '-o', 'out.json', ] ) except SystemExit: pass
  2. Place a breakpoint in the line you wish to debug

  3. Run runner.py with vscode debugger

Configure your json file like that:

"version": "0.2.0",
"configurations": [
    {
        "name": "Crawl with scrapy",
        "type": "python",
        "request": "launch",
        "module": "scrapy",
        "cwd": "${fileDirname}",
        "args": [
            "crawl",
            "<SPIDER NAME>"
        ],
        "console": "internalConsole"
    }
]

Click on the tab in VSCode corresponding to your spider, then launch a debug session corresponding to the json file.

I made it. The simplest way is to make a runner script runner.py

import scrapy
from scrapy.crawler import CrawlerProcess

from g4gscraper.spiders.g4gcrawler import G4GSpider

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'FEED_FORMAT': 'json',
    'FEED_URI': 'data.json'
})

process.crawl(G4GSpider)
process.start() # the script will block here until the crawling is finished

Then I added breakpoints in the spider while I launched debugger on this file. Reference: https://doc.scrapy.org/en/latest/topics/practices.html

Don't need to modify the launch.json, the default "Python: Current File (Integrated Terminal)" works perfectly. For the Python3 project, remember to place the runner.py file at the same level as the scrapy.cfg file (which is the project root).

The runner.py code as @naqushab does above. Note the processs.crawl( className ) , where the className is the spiders class that you want to set the breakpoint at.

You could also try with

{
  "configurations": [
    {
        "name": "Python: Scrapy",
        "type": "python",
        "request": "launch",
        "module": "scrapy",
        "cwd": "${fileDirname}",
        "args": [
            "crawl",
            "${fileBasenameNoExtension}",
            "--loglevel=ERROR"
        ],
        "console": "integratedTerminal",
        "justMyCode": false
    }
  ]
}

but the name of the field should be the same than the spiders name.

The --loglevel=ERROR is to get an output less verbose ;)

I applied @fmango's code and improve it.

  1. Instead of write a seperated runner file, just paste these code lines at the end of spider.

  2. run python debugger. that's all

if __name__ == '__main__':
    import os
    from scrapy.cmdline import execute

    os.chdir(os.path.dirname(os.path.realpath(__file__)))

    SPIDER_NAME = MySpider.name
    try:
        execute(
            [
                'scrapy',
                'crawl',
                SPIDER_NAME,
                '-s',
                'FEED_EXPORT_ENCODING=utf-8',
            ]
        )
    except SystemExit:
        pass

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM