简体   繁体   中英

Scrapy crawl wrong spider

In scrapy crawl [spider-name] fault the OP says

In spider folder of my project i have two spiders named spider1 and spider2….Now when i write the command scrapy crawl spider1 in my root project folder it calls spider2.py instead of spider1.py. when i will delete spider2.py from my project then it calls spider1.py

I have experienced this exact same behavior and used this exact same solution. The responses to the OP all boil down to deleting all .pyc files.

I have cleaned spider1.pyc ,spider2.pyc and init.pyc. Now when i run scrapy crawl spider1 in my root flder of project it actually runs spider2.py but spider1.pyc file is generated instead of spider2.pyc

I have seen exactly this behavior as well.

But the docs don't say anything about all these gotchas and workarounds. https://doc.scrapy.org/en/latest/intro/tutorial.html

"name: identifies the Spider. It must be unique within a project, that is, you can't set the same name for different Spiders."

https://doc.scrapy.org/en/1.0/topics/spiders.html#scrapy.spiders.Spider "name: A string which defines the name for this spider. The spider name is how the spider is located (and instantiated) by Scrapy, so it must be unique. However, nothing prevents you from instantiating more than one instance of the same spider. This is the most important spider attribute and it's required."

This makes sense so Scrapy knows which spider to run, but it's not working, so what's missing? Thanks.

EDIT Ok, so it happened again. This is my traceback:

(aishah) malikarumi@Tetuoan2:~/Projects/aishah/acquire$ scrapy crawl crawl_h4
Traceback (most recent call last):
File "/home/malikarumi/Projects/aishah/bin/scrapy", line 11, in <module>
sys.exit(execute())
File "/home/malikarumi/Projects/aishah/lib/python3.5/site-packages/scrapy /cmdline.py", line 141, in execute
cmd.crawler_process = CrawlerProcess(settings)
File "/home/malikarumi/Projects/aishah/lib/python3.5/site-packages/scrapy/crawler.py", line 238, in __init__
super(CrawlerProcess, self).__init__(settings)
File "/home/malikarumi/Projects/aishah/lib/python3.5/site-packages/scrapy/crawler.py", line 129, in __init__
self.spider_loader = _get_spider_loader(settings)
File "/home/malikarumi/Projects/aishah/lib/python3.5/site-packages/scrapy/crawler.py", line 325, in _get_spider_loader
return loader_cls.from_settings(settings.frozencopy())
File "/home/malikarumi/Projects/aishah/lib/python3.5/site-packages/scrapy/spiderloader.py", line 33, in from_settings
return cls(settings)
File "/home/malikarumi/Projects/aishah/lib/python3.5/site-packages/scrapy/spiderloader.py", line 20, in __init__
self._load_all_spiders()
File "/home/malikarumi/Projects/aishah/lib/python3.5/site-packages/scrapy/spiderloader.py", line 28, in _load_all_spiders
for module in walk_modules(name):
File "/home/malikarumi/Projects/aishah/lib/python3.5/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
submod = import_module(fullpath)
File "/usr/lib/python3.5/importlib/__init__.py", line 126, in  import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 986, in _gcd_import
File "<frozen importlib._bootstrap>", line 969, in _find_and_load
File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 661, in exec_module
File "<frozen importlib._bootstrap_external>", line 767, in get_code
File "<frozen importlib._bootstrap_external>", line 727, in source_to_code
File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
File "/home/malikarumi/Projects/aishah/acquire/acquire/spiders/crawl_h3.py",
line 19  (follow=True, callback='parse_item'),))
               ^
SyntaxError: invalid syntax`

PLEASE NOTE: I called crawl_h4. I got crawl_h3. I left crawl_h3 as is, including the syntax error, so I would have something to compare as I refactor. This syntax error is not in crawl_h4.

The settings are unchanged at default. The docs also say "Arguments provided by the command line are the ones that take most precedence, overriding any other options. You can explicitly override one (or more) settings using the -s (or --set) command line option." https://doc.scrapy.org/en/latest/topics/settings.html#topics-settings

I see a line up there in the traceback a reference to frozencopy. The docs talk about using this to make the setting immutable. https://doc.scrapy.org/en/latest/topics/api.html . I don't know what the use case for that is, but I didn't select it and I'm not sure how to unselect it, if that's the problem.

None of your spiders can have syntax errors even if you are not running that spider. I am assuming scrapy compiles all your spiders even if you only want to run one of them. Just because it is catching errors in your other spiders does not mean it isn't running the spider you called. I had similar experiences where scrapy catches errors in spiders I was not currently trying to run but, it still runs the spider I want in the end. Fix your syntax error and try to use a different way to verify your spider ran such a print or collect different data then your other spiders.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM