Python basic types do not count inheritance as I expected. For example:
class MyUnicode(unicode):
pass
mu = MyUnicode('xxx')
>>> type(mu)
<class '__main__.MyUnicode'> # ok
>>> type(mu + 'x')
<type 'unicode'> # why not <class '__main__.MyUnicode'> ?
>>> type(mu.strip())
<type 'unicode'> # why not <class '__main__.MyUnicode'> ?
Strings are immutable, then those two methods must return new objects. But why developers hardcoded the unicode
return type inside those methods instead of using sub-class? Does it prevent some potential drawbacks, I'm not aware of?
Personally I would like to see how can you implement unicode
to work the way you want, without relying on some implementation details of the subclass.
If you'll try (it's a pseudocode for obvious reasons):
def __add__(self, other):
return type(self)(concatenate(self.string, other.string))
You are forcing the class to have a single-arg constructor. why? In addition, you have mu + u'x'
and u'x' + mu
being of different types, and it also guaranteed to be less efficient.
There is no reason to do additional work only for some unknown unlikely subclass to behave in some specific way. I don't think unicode
is designed to be subclassed; If you do subclass it, and want a different behavior than you base class, go ahead and override the relevant methods yourself.
You didn't overrride .strip
and __add__
, which by default still return unicode objects and not instances of you class. Here is the source for the strip functionality (2.7.3). As for why the developers decided to return a Py_UNICODE object from a unicode method instead of checking for a subclass first and returning that? I think the idea here is you are given just enough to do it yourself.
Because Python tends to favor simplicity, it is usually not aware of subclasses when dealing with built-in types.
On the main implementation (written in C), it is easier to cast something to Py_UCS4
or PyListObject
or other type defined in C than to fetch dynamically a type and call the constructor (that might not be compatible with the super class constructor and crash).
The rule of thumb is to replace all the methods you need if you want to fully replace a built-in class, but I advise against.
Many times, it is simpler to just make it an attribute of a class and add methods as you need:
class MyUnicode(object):
def __init__(self, u):
self.u = unicode(u)
After multiple comments from OP, I've decided to rewrite my answer from scratch.
Possible workaround:
class MyUnicodeMetaClass(type):
autocast_methods = ('__add__', '__radd__', 'format')
def __init__(cls, name, bases, attrs):
super(MyUnicodeMetaClass, cls).__init__(name, bases, attrs)
for method_name in MyUnicodeMetaClass.autocast_methods:
try:
setattr(cls, method_name, cls.autocast_creator(method_name))
except AttributeError:
if method_name.startswith('__r'):
setattr(cls, method_name, cls.autocast_reverse(method_name))
else:
raise
def autocast_creator(cls, method_name):
method = unicode().__getattribute__(method_name)
def autocast_method(self, *args, **kwargs):
method = unicode(self).__getattribute__(method_name)
return cls(method(*args, **kwargs))
return autocast_method
def autocast_reverse(cls, method_name):
method_name = method_name.replace('__r', '__', 1)
def autocast_method(self, *args, **kwargs):
method = unicode(args[0]).__getattribute__(method_name)
return cls(method(self, *args[1:], **kwargs))
return autocast_method
class MyUnicode(unicode):
__metaclass__ = MyUnicodeMetaClass
a = MyUnicode(u'aaa {0}')
print a, type(a)
# aaa {0} <class '__main__.MyUnicode'>
b = a + u'bbb'
print b, type(b)
# aaa {0}bbb <class '__main__.MyUnicode'>
c = u'ddd' + a
print c, type(c)
# cccaaa {0} <class '__main__.MyUnicode'>
d = a.format(115)
print d, type(d)
# aaa 115 <class '__main__.MyUnicode'>
It might require extending but the base skeleton is ready.
What is happening here?
1. Metaclass
is used to alter creation of MyUnicode
class.
2. Simple autocast_creator
is used to populate MyUnicode
class with methods that should return MyUnicode
instead of unicode
.
3. A little more sophisticated autocast_reverse
is used to provide reverse methods as well (like __radd__
which is needed when first operand is unicode
and second one is MyUnicode
)
This way, you don't have to manually override all the methods - just list them in autocast_methods
tuple.
Background information:
Inheritance:
Object Oriented Programming is intended to reflect real word as well as possible.
Inheritance is no exception here.
In real world a group of elephants is always a group of animals. No doubts here.
But a group of animals might be a group of elephants but it can also be a group of various animals and there can even be no elephant in this group.
This is because any group of elephants is a special kind of group of animals.
So it can be reflected in computer programme by defining ElephantGroup
as a subclass of AnimalGroup
.
How can ElephantGroup
extend AnimalGroup
? For example by defining new field ivory_weight
and new method toot()
.
Consider such a simple operation ElephantGroup() + AnimalGroup()
.
What is the class of expected result? AnimalGroup
- no magic here and rats, dolphins etc. won't become elephants. Rats and dolphins don't provide ivory and they can't toot, so forcing them to do so is not an expected behavior.
Let's get back to MyUnicode
and unicode
.
Machines aren't intelligent in the meaning presented in Matrix or Terminator.
The Python interpreter doesn't understand what is the purpose of MyUnicode
.
Consider a class extending unicode
or str
(doesn't really matter here) that is named EmailAddress
and is intended to hold e-mail address. No surprise :)
And we have a code snippet now:
a = EmailAddress(u'example@example.com')
b = u'/!\n'
c = a + b
d = b + a
Still expecting c
and d
to be instances of EmailAddress
? (Or MyUnicode
?)
If you have answered yes then please tell me:
1. What if EmailAddress.__init__(...)
contains well crafted logic checking if an argument is likely to be a valid e-mail address? And it raises exception if it isn't...
2. How can interpreter be aware that it can safely initialize MyUnicode
instance with any unicode
instance without executing __init__
? Please also remember this is Python and __init__
can even be changed dynamically at runtime.
Remember - we can always cast instance to any of it's ancestors. The reverse operation can't be done implicitly. It would be a mess if the interpreter would implicitly cast object
instances to objects
subclasses.
The strip()
method follows the same rules - unicode
characters are removed from the original and a reference to new unicode
instance is returned (unless exact same unicode
exists - in this case the reference to existing one is returned).
Referring to instance class:
In one of your comments you said cls
refers to the class of instance that started the execution chain...
cls
is a naming convention used in metaclasses
, classmethods
and in __new__()
method to indicate that we don't have an instance - we only have a class.
In fact we can't access instance in any of these cases - __new__()
however should return new instance in most cases.
I guess you were thinking of identifier.__class__
attribute. It has nothing to do with execution chain either. It points to the actual class of instance referred by identifier
.
Why do you expect methods of unicode
and str
to use it to create subclasses?
Casting something implicitly to it's subclass is not an expected behavior - I know, one of the operands in your code is MyUnicode
but the other one is unicode
- even in strip()
, the default argument is a unicode
containg whitespace characters.
Some unicode implementation details:
Python unicode
and string
types are immutable and unique (explanation is coming). Immutable means that any modyfying operation on them returns other instance of unicode
or string
respectively.
Other instance would mean new instance but as I said these types are unique .
What it means? See the code:
a = u'aaa'
b = u'aaa'
What happened here?
There was a new instance of unicode
created to initialize a
.
The unicode
object to initialize b
was found so no new instance was created.
Instead a reference counter to unicode
object holding u'aaa'
was incremented.
Now when we know it, consider this code:
a = u'aaa'
b = MyUnicode(u'aa')
c = b + u'a'
What exactly is stored in c
variable? A reference to unicode
object - the same object which is referenced by a
.
Why changing c
won't affect a
? Because unicode
is immutable and underlying object remains unchanged.
If the next line would be c = c + u'b'
, then c
would get reference to new/other instance and object referenced by a
would have it's reference counter decreased.
Conclusions:
The Python unicode
and str
classes are consistent and predictive.
There are some types that are hard to derive from, due to optimization, special purposes or implementation details.
unicode
and str
are rather not intended to be subclassed altough it can be achieved for instance with metaclass
as in my snippet.
As always I'm looking for any constructive criticism and comments.
Good luck!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.