简体   繁体   English

Scrapy 爬错蜘蛛

[英]Scrapy crawl wrong spider

In scrapy crawl [spider-name] fault the OP saysscrapy crawl [spider-name] 错误中,OP 说

In spider folder of my project i have two spiders named spider1 and spider2….Now when i write the command scrapy crawl spider1 in my root project folder it calls spider2.py instead of spider1.py.在我的项目的蜘蛛文件夹中,我有两个蜘蛛,分别命名为蜘蛛 1 和蜘蛛 2……现在当我在我的项目根文件夹中编写命令 scrapy crawl spider1 时,它调用 spider2.py 而不是 spider1.py。 when i will delete spider2.py from my project then it calls spider1.py当我从我的项目中删除 spider2.py 然后它调用 spider1.py

I have experienced this exact same behavior and used this exact same solution.我经历过这种完全相同的行为并使用了完全相同的解决方案。 The responses to the OP all boil down to deleting all .pyc files.对 OP 的回应都归结为删除所有 .pyc 文件。

I have cleaned spider1.pyc ,spider2.pyc and init.pyc.我已经清理了 Spider1.pyc、spider2.pyc 和 init.pyc。 Now when i run scrapy crawl spider1 in my root flder of project it actually runs spider2.py but spider1.pyc file is generated instead of spider2.pyc现在,当我在项目的根目录中运行 scrapy crawl spider1 时,它实际上运行了 spider2.py 但生成了 spider1.pyc 文件而不是 spider2.pyc

I have seen exactly this behavior as well.我也看到了这种行为。

But the docs don't say anything about all these gotchas and workarounds.但是文档没有说明所有这些问题和解决方法。 https://doc.scrapy.org/en/latest/intro/tutorial.html https://doc.scrapy.org/en/latest/intro/tutorial.html

"name: identifies the Spider. It must be unique within a project, that is, you can't set the same name for different Spiders." “name:标识Spider。它在一个项目中必须是唯一的,即不能为不同的Spider设置相同的名称。”

https://doc.scrapy.org/en/1.0/topics/spiders.html#scrapy.spiders.Spider "name: A string which defines the name for this spider. The spider name is how the spider is located (and instantiated) by Scrapy, so it must be unique. However, nothing prevents you from instantiating more than one instance of the same spider. This is the most important spider attribute and it's required." https://doc.scrapy.org/en/1.0/topics/spiders.html#scrapy.spiders.Spider “名称:定义此蜘蛛名称的字符串。蜘蛛名称是蜘蛛的位置(并实例化) ) 由 Scrapy 提供,因此它必须是唯一的。但是,没有什么可以阻止您实例化同一个蜘蛛的多个实例。这是最重要的蜘蛛属性,并且是必需的。”

This makes sense so Scrapy knows which spider to run, but it's not working, so what's missing?这是有道理的,所以 Scrapy 知道要运行哪个蜘蛛,但它不起作用,那么缺少什么? Thanks.谢谢。

EDIT Ok, so it happened again.编辑好的,它又发生了。 This is my traceback:这是我的回溯:

(aishah) malikarumi@Tetuoan2:~/Projects/aishah/acquire$ scrapy crawl crawl_h4
Traceback (most recent call last):
File "/home/malikarumi/Projects/aishah/bin/scrapy", line 11, in <module>
sys.exit(execute())
File "/home/malikarumi/Projects/aishah/lib/python3.5/site-packages/scrapy /cmdline.py", line 141, in execute
cmd.crawler_process = CrawlerProcess(settings)
File "/home/malikarumi/Projects/aishah/lib/python3.5/site-packages/scrapy/crawler.py", line 238, in __init__
super(CrawlerProcess, self).__init__(settings)
File "/home/malikarumi/Projects/aishah/lib/python3.5/site-packages/scrapy/crawler.py", line 129, in __init__
self.spider_loader = _get_spider_loader(settings)
File "/home/malikarumi/Projects/aishah/lib/python3.5/site-packages/scrapy/crawler.py", line 325, in _get_spider_loader
return loader_cls.from_settings(settings.frozencopy())
File "/home/malikarumi/Projects/aishah/lib/python3.5/site-packages/scrapy/spiderloader.py", line 33, in from_settings
return cls(settings)
File "/home/malikarumi/Projects/aishah/lib/python3.5/site-packages/scrapy/spiderloader.py", line 20, in __init__
self._load_all_spiders()
File "/home/malikarumi/Projects/aishah/lib/python3.5/site-packages/scrapy/spiderloader.py", line 28, in _load_all_spiders
for module in walk_modules(name):
File "/home/malikarumi/Projects/aishah/lib/python3.5/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
submod = import_module(fullpath)
File "/usr/lib/python3.5/importlib/__init__.py", line 126, in  import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 986, in _gcd_import
File "<frozen importlib._bootstrap>", line 969, in _find_and_load
File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 661, in exec_module
File "<frozen importlib._bootstrap_external>", line 767, in get_code
File "<frozen importlib._bootstrap_external>", line 727, in source_to_code
File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
File "/home/malikarumi/Projects/aishah/acquire/acquire/spiders/crawl_h3.py",
line 19  (follow=True, callback='parse_item'),))
               ^
SyntaxError: invalid syntax`

PLEASE NOTE: I called crawl_h4.请注意:我打电话给crawl_h4。 I got crawl_h3.我得到了 crawl_h3。 I left crawl_h3 as is, including the syntax error, so I would have something to compare as I refactor.我保留了 crawl_h3 原样,包括语法错误,所以我在重构时会有一些比较。 This syntax error is not in crawl_h4.此语法错误不在 crawl_h4 中。

The settings are unchanged at default.默认设置不变。 The docs also say "Arguments provided by the command line are the ones that take most precedence, overriding any other options. You can explicitly override one (or more) settings using the -s (or --set) command line option."文档还说“命令行提供的参数是最优先的参数,覆盖任何其他选项。您可以使用 -s(或 --set)命令行选项明确覆盖一个(或多个)设置。” https://doc.scrapy.org/en/latest/topics/settings.html#topics-settings https://doc.scrapy.org/en/latest/topics/settings.html#topics-settings

I see a line up there in the traceback a reference to frozencopy.我在回溯中看到一行是对frozencopy的引用。 The docs talk about using this to make the setting immutable.文档讨论了使用它使设置不可变。 https://doc.scrapy.org/en/latest/topics/api.html . https://doc.scrapy.org/en/latest/topics/api.html I don't know what the use case for that is, but I didn't select it and I'm not sure how to unselect it, if that's the problem.我不知道它的用例是什么,但我没有选择它,我不知道如何取消选择它,如果这是问题所在。

None of your spiders can have syntax errors even if you are not running that spider.即使您没有运行该蜘蛛,您的蜘蛛也不会出现语法错误。 I am assuming scrapy compiles all your spiders even if you only want to run one of them.我假设scrapy 编译你所有的蜘蛛,即使你只想运行其中之一。 Just because it is catching errors in your other spiders does not mean it isn't running the spider you called.仅仅因为它在您的其他蜘蛛中捕获错误并不意味着它没有运行您调用的蜘蛛。 I had similar experiences where scrapy catches errors in spiders I was not currently trying to run but, it still runs the spider I want in the end.我有过类似的经历,scrapy 在我目前没有尝试运行的蜘蛛中捕获错误,但它最终仍然运行我想要的蜘蛛。 Fix your syntax error and try to use a different way to verify your spider ran such a print or collect different data then your other spiders.修复您的语法错误并尝试使用不同的方式来验证您的蜘蛛是否运行了这样的打印或收集了与其他蜘蛛不同的数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM