What are the risks of overriding a scrapy.spider's init method?

Question

In some stack overflow questions I've seen accepted answers where the __init__ method of the scrapy.spider superclass is overriden by the user defined spider. For example: selenium with scrapy for dynamic page .

My question is, what are the risks of doing so? The __init__ of the super class looks like this:

class Spider(object_ref):
    """Base class for scrapy spiders. All spiders must inherit from this
    class.
    """

    name = None
    custom_settings = None

    def __init__(self, name=None, **kwargs):
        if name is not None:
            self.name = name
        elif not getattr(self, 'name', None):
            raise ValueError("%s must have a name" % type(self).__name__)
        self.__dict__.update(kwargs)
        if not hasattr(self, 'start_urls'):
            self.start_urls = []

So, if I were to define an __init__ in my spider that inherits from this class and didn't include a call to the superclass __init__ would I be breaking scrapy functionality? How to mitigate that risk? Call the super's __init__ in my spider? Looking for best practices for scrapy and also a better understanding of __init__ calls in the context of class inheritance.

Answer 1

None if you use super().__init__(*args, **kwargs) .

Anything else is a risk. You are copying code from the __init__ method of Spider in a specific Scrapy version, hence the only safe upgrade path involves checking how the Spider.__init__ implementation changes in new Scrapy versions and applying the changes to your custom implementation as you upgrade Scrapy.

If you can implement the same logic keeping a call to super().__init__(*args, **kwargs) , that would be best.

If not, looking for alternative implementations, or opening a feature request so that Scrapy can accommodate your use case in an upgrade-safe way, would be better long-term solutions.

Answer 2

If you see the Spider.__init__ , it only takes care of self.name and self.start_urls . If you handle these by yourself in class attributes just like the example answer you mentioned, you can completely skip the init method altogether and it would still work just fine.

In python, init is just a function that gets called for custom initialization and if you don't define it, it's equivalent to doing def __init__(self): pass .

super().__init__ is good to have for cooperative inheritance where you have multiple base classes. For spider, it's mostly unrelated, unless you are writing a ton of spiders that are related and actually need cooperative inheritance.

lt;dr: you can skip it altogether. Just make sure you define name and start_urls in either your init or in class attributes

Answer 3

I get it now. Thanks.

In order to preserve the functionality of a super class' __init__ while also extending the functionality in your custom subclass you'd do this.

In the subclass __init__ method you'd add you're custom keyword args then end by passing *args, **kwargs . Then explicitly call super().__init__(*args, **kwargs) in the body of the __init__ . Like this:

class SubClass(SuperClass)
    def __init__(self, custom_1, custom_2, *args, **kwargs):

        # Your code here that handles custom args

        super().__init__(*args, **kwargs)

The custom arguments will be handled by your custom code then *args, **kwargs will be consumed by the super class' __init__ . Careful that you get the order of the __init__ calls right if they're dependent on each other.

A perfect example of this whole thing is the SeleniumRequest in scrapy-selenium middleware.

What are the risks of overriding a scrapy.spider's init method?

Question

3 answers

solution1
3 ACCPTED 2020-05-19 17:11:35

solution2
2 2020-05-19 17:34:10

solution3
0 2020-05-23 05:27:32

What are the risks of overriding a scrapy.spider's __init__ method?

Question

3 answers

solution1 3 ACCPTED 2020-05-19 17:11:35

solution2 2 2020-05-19 17:34:10

solution3 0 2020-05-23 05:27:32

What are the risks of overriding a scrapy.spider's init method?

solution1
3 ACCPTED 2020-05-19 17:11:35

solution2
2 2020-05-19 17:34:10

solution3
0 2020-05-23 05:27:32