Python basic types do not count inheritance

Question

Python basic types do not count inheritance as I expected. For example:

class MyUnicode(unicode):
    pass

mu = MyUnicode('xxx')

>>> type(mu)
<class '__main__.MyUnicode'> # ok

>>> type(mu + 'x')
<type 'unicode'> # why not <class '__main__.MyUnicode'> ?

>>> type(mu.strip())
<type 'unicode'> # why not <class '__main__.MyUnicode'> ?

Strings are immutable, then those two methods must return new objects. But why developers hardcoded the unicode return type inside those methods instead of using sub-class? Does it prevent some potential drawbacks, I'm not aware of?

Answer 1

Personally I would like to see how can you implement unicode to work the way you want, without relying on some implementation details of the subclass.

If you'll try (it's a pseudocode for obvious reasons):

def __add__(self, other):
    return type(self)(concatenate(self.string, other.string))

You are forcing the class to have a single-arg constructor. why? In addition, you have mu + u'x' and u'x' + mu being of different types, and it also guaranteed to be less efficient.

There is no reason to do additional work only for some unknown unlikely subclass to behave in some specific way. I don't think unicode is designed to be subclassed; If you do subclass it, and want a different behavior than you base class, go ahead and override the relevant methods yourself.

Answer 2

You didn't overrride .strip and __add__ , which by default still return unicode objects and not instances of you class. Here is the source for the strip functionality (2.7.3). As for why the developers decided to return a Py_UNICODE object from a unicode method instead of checking for a subclass first and returning that? I think the idea here is you are given just enough to do it yourself.

Answer 3

Because Python tends to favor simplicity, it is usually not aware of subclasses when dealing with built-in types.

On the main implementation (written in C), it is easier to cast something to Py_UCS4 or PyListObject or other type defined in C than to fetch dynamically a type and call the constructor (that might not be compatible with the super class constructor and crash).

The rule of thumb is to replace all the methods you need if you want to fully replace a built-in class, but I advise against.

Many times, it is simpler to just make it an attribute of a class and add methods as you need:

class MyUnicode(object):
    def __init__(self, u):
        self.u = unicode(u)

Answer 4

After multiple comments from OP, I've decided to rewrite my answer from scratch.

Possible workaround:

class MyUnicodeMetaClass(type):

    autocast_methods = ('__add__', '__radd__', 'format')

    def __init__(cls, name, bases, attrs):   
        super(MyUnicodeMetaClass, cls).__init__(name, bases, attrs)
        for method_name in MyUnicodeMetaClass.autocast_methods:
            try:
                setattr(cls, method_name, cls.autocast_creator(method_name))
            except AttributeError:
                if method_name.startswith('__r'):
                    setattr(cls, method_name, cls.autocast_reverse(method_name))
                else:
                    raise

    def autocast_creator(cls, method_name):
        method = unicode().__getattribute__(method_name)
        def autocast_method(self, *args, **kwargs):
            method = unicode(self).__getattribute__(method_name)
            return cls(method(*args, **kwargs))
        return autocast_method

    def autocast_reverse(cls, method_name):
        method_name = method_name.replace('__r', '__', 1)
        def autocast_method(self, *args, **kwargs):
            method = unicode(args[0]).__getattribute__(method_name)
            return cls(method(self, *args[1:], **kwargs))
        return autocast_method


class MyUnicode(unicode):
    __metaclass__ = MyUnicodeMetaClass

a = MyUnicode(u'aaa {0}')
print a, type(a)
# aaa {0} <class '__main__.MyUnicode'>
b = a + u'bbb'
print b, type(b)
# aaa {0}bbb <class '__main__.MyUnicode'>
c = u'ddd' + a
print c, type(c)
# cccaaa {0} <class '__main__.MyUnicode'>
d = a.format(115)
print d, type(d)
# aaa 115 <class '__main__.MyUnicode'>

It might require extending but the base skeleton is ready.
What is happening here?
1. Metaclass is used to alter creation of MyUnicode class.
2. Simple autocast_creator is used to populate MyUnicode class with methods that should return MyUnicode instead of unicode .
3. A little more sophisticated autocast_reverse is used to provide reverse methods as well (like __radd__ which is needed when first operand is unicode and second one is MyUnicode )
This way, you don't have to manually override all the methods - just list them in autocast_methods tuple.

Background information:

Inheritance:
Object Oriented Programming is intended to reflect real word as well as possible.
Inheritance is no exception here.

In real world a group of elephants is always a group of animals. No doubts here.
But a group of animals might be a group of elephants but it can also be a group of various animals and there can even be no elephant in this group.
This is because any group of elephants is a special kind of group of animals.
So it can be reflected in computer programme by defining ElephantGroup as a subclass of AnimalGroup .
How can ElephantGroup extend AnimalGroup ? For example by defining new field ivory_weight and new method toot() .
Consider such a simple operation ElephantGroup() + AnimalGroup() .
What is the class of expected result? AnimalGroup - no magic here and rats, dolphins etc. won't become elephants. Rats and dolphins don't provide ivory and they can't toot, so forcing them to do so is not an expected behavior.

Let's get back to MyUnicode and unicode .
Machines aren't intelligent in the meaning presented in Matrix or Terminator.
The Python interpreter doesn't understand what is the purpose of MyUnicode .
Consider a class extending unicode or str (doesn't really matter here) that is named EmailAddress and is intended to hold e-mail address. No surprise :)
And we have a code snippet now:

a = EmailAddress(u'example@example.com')
b = u'/!\n'
c = a + b
d = b + a

Still expecting c and d to be instances of EmailAddress ? (Or MyUnicode ?)
If you have answered yes then please tell me:
1. What if EmailAddress.__init__(...) contains well crafted logic checking if an argument is likely to be a valid e-mail address? And it raises exception if it isn't...
2. How can interpreter be aware that it can safely initialize MyUnicode instance with any unicode instance without executing __init__ ? Please also remember this is Python and __init__ can even be changed dynamically at runtime.
Remember - we can always cast instance to any of it's ancestors. The reverse operation can't be done implicitly. It would be a mess if the interpreter would implicitly cast object instances to objects subclasses.
The strip() method follows the same rules - unicode characters are removed from the original and a reference to new unicode instance is returned (unless exact same unicode exists - in this case the reference to existing one is returned).

Referring to instance class:

In one of your comments you said cls refers to the class of instance that started the execution chain...
cls is a naming convention used in metaclasses , classmethods and in __new__() method to indicate that we don't have an instance - we only have a class.
In fact we can't access instance in any of these cases - __new__() however should return new instance in most cases.

I guess you were thinking of identifier.__class__ attribute. It has nothing to do with execution chain either. It points to the actual class of instance referred by identifier .
Why do you expect methods of unicode and str to use it to create subclasses?
Casting something implicitly to it's subclass is not an expected behavior - I know, one of the operands in your code is MyUnicode but the other one is unicode - even in strip() , the default argument is a unicode containg whitespace characters.

Some unicode implementation details:

Python unicode and string types are immutable and unique (explanation is coming). Immutable means that any modyfying operation on them returns other instance of unicode or string respectively.
Other instance would mean new instance but as I said these types are unique .
What it means? See the code:

a = u'aaa'
b = u'aaa'

What happened here?
There was a new instance of unicode created to initialize a .
The unicode object to initialize b was found so no new instance was created.
Instead a reference counter to unicode object holding u'aaa' was incremented.

Now when we know it, consider this code:

a = u'aaa'
b = MyUnicode(u'aa')
c = b + u'a'

What exactly is stored in c variable? A reference to unicode object - the same object which is referenced by a .
Why changing c won't affect a ? Because unicode is immutable and underlying object remains unchanged.
If the next line would be c = c + u'b' , then c would get reference to new/other instance and object referenced by a would have it's reference counter decreased.

Conclusions:

The Python unicode and str classes are consistent and predictive.
There are some types that are hard to derive from, due to optimization, special purposes or implementation details.
unicode and str are rather not intended to be subclassed altough it can be achieved for instance with metaclass as in my snippet.

As always I'm looking for any constructive criticism and comments.

Good luck!

Python basic types do not count inheritance

Question

4 answers

solution1
2 2013-07-03 14:08:13

solution2
1 2013-07-03 14:07:50

solution3
0 2013-07-03 14:25:03

solution4
0 2013-07-03 14:27:10

Python basic types do not count inheritance

Question

4 answers

solution1 2 2013-07-03 14:08:13

solution2 1 2013-07-03 14:07:50

solution3 0 2013-07-03 14:25:03

solution4 0 2013-07-03 14:27:10

solution1
2 2013-07-03 14:08:13

solution2
1 2013-07-03 14:07:50

solution3
0 2013-07-03 14:25:03

solution4
0 2013-07-03 14:27:10