In some stack overflow questions I've seen accepted answers where the __init__
method of the scrapy.spider superclass is overriden by the user defined spider. For example: selenium with scrapy for dynamic page .
My question is, what are the risks of doing so? The __init__
of the super class looks like this:
class Spider(object_ref):
"""Base class for scrapy spiders. All spiders must inherit from this
class.
"""
name = None
custom_settings = None
def __init__(self, name=None, **kwargs):
if name is not None:
self.name = name
elif not getattr(self, 'name', None):
raise ValueError("%s must have a name" % type(self).__name__)
self.__dict__.update(kwargs)
if not hasattr(self, 'start_urls'):
self.start_urls = []
So, if I were to define an __init__
in my spider that inherits from this class and didn't include a call to the superclass __init__
would I be breaking scrapy functionality? How to mitigate that risk? Call the super's __init__
in my spider? Looking for best practices for scrapy and also a better understanding of __init__
calls in the context of class inheritance.
None if you use super().__init__(*args, **kwargs)
.
Anything else is a risk. You are copying code from the __init__
method of Spider
in a specific Scrapy version, hence the only safe upgrade path involves checking how the Spider.__init__
implementation changes in new Scrapy versions and applying the changes to your custom implementation as you upgrade Scrapy.
If you can implement the same logic keeping a call to super().__init__(*args, **kwargs)
, that would be best.
If not, looking for alternative implementations, or opening a feature request so that Scrapy can accommodate your use case in an upgrade-safe way, would be better long-term solutions.
If you see the Spider.__init__
, it only takes care of self.name
and self.start_urls
. If you handle these by yourself in class attributes just like the example answer you mentioned, you can completely skip the init method altogether and it would still work just fine.
In python, init is just a function that gets called for custom initialization and if you don't define it, it's equivalent to doing def __init__(self): pass
.
super().__init__
is good to have for cooperative inheritance where you have multiple base classes. For spider, it's mostly unrelated, unless you are writing a ton of spiders that are related and actually need cooperative inheritance.
lt;dr: you can skip it altogether. Just make sure you define name
and start_urls
in either your init or in class attributes
I get it now. Thanks.
In order to preserve the functionality of a super class' __init__
while also extending the functionality in your custom subclass you'd do this.
In the subclass __init__
method you'd add you're custom keyword args then end by passing *args, **kwargs
. Then explicitly call super().__init__(*args, **kwargs)
in the body of the __init__
. Like this:
class SubClass(SuperClass)
def __init__(self, custom_1, custom_2, *args, **kwargs):
# Your code here that handles custom args
super().__init__(*args, **kwargs)
The custom arguments will be handled by your custom code then *args, **kwargs
will be consumed by the super class' __init__
. Careful that you get the order of the __init__
calls right if they're dependent on each other.
A perfect example of this whole thing is the SeleniumRequest in scrapy-selenium middleware.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.