Cython implementation no faster than pure python

Question

For an exercise I've written a XOR doubly-linked list

%%cython

from cpython.object cimport PyObject
from cpython.ref cimport Py_XINCREF, Py_XDECREF
from libc.stdint cimport uintptr_t

cdef class Node:
    cdef uintptr_t _prev_xor_next
    cdef object val

    def __init__(self, object val, uintptr_t prev_xor_next=0):
        self._prev_xor_next=prev_xor_next
        self.val=val

    @property
    def prev_xor_next(self):
        return self._prev_xor_next
    @prev_xor_next.setter
    def prev_xor_next(self, uintptr_t p):
        self._prev_xor_next=p

    def __repr__(self):
        return str(self.val)


cdef class CurrentNode(Node):
    cdef uintptr_t _node, _prev_ptr
    def __init__(self, uintptr_t node, uintptr_t prev_ptr=0):
        self._node = node
        self._prev_ptr= prev_ptr

    @property
    def val(self):
        return self.node.val
    @property
    def node(self):
        ret=<PyObject *> self._node
        return <Node> ret
    @property
    def prev_ptr(self):
        return self._prev_ptr

    cdef CurrentNode forward(self):
        if self.node.prev_xor_next!=self._prev_ptr:
            return CurrentNode(self.node.prev_xor_next^self._prev_ptr, self._node)

    cdef CurrentNode backward(self):
        if self._prev_ptr:
            pp=<PyObject*>self._prev_ptr
            return CurrentNode(self._prev_ptr, self._node^(<Node> pp).prev_xor_next)

    def __repr__(self):
        return str(self.node)

cdef class XORList:
    cdef PyObject* first
    cdef PyObject* last
    cdef int length

    def __init__(self):
        self.length=0
    @property
    def head(self):
        return (<Node> self.first)

    @property
    def tail(self):
        return (<Node> self.last)

    cdef append(self, object val):
        self.length+=1
        #empty list
        if not self.first:
            t=Node(val)
            tp=(<PyObject*> t)
            self.first=tp
            Py_XINCREF(tp)
            self.last=tp
            Py_XINCREF(tp)

        #not empty
        else:
            new_node=Node(val, <uintptr_t> self.last)
            new_ptr=<PyObject*> new_node
            cur_last=<Node>self.last
            cur_last.prev_xor_next=cur_last.prev_xor_next^(<uintptr_t> new_ptr)
            Py_XINCREF(new_ptr)
            self.last=new_ptr
            Py_XINCREF(new_ptr)

    cpdef reverse(self):
        temp=self.last
        self.last=self.first
        self.first=temp

    def __repr__(self):
        return str(list(iter_XORList(self)))
    def __len__(self):
        return self.length

def iter_XORList(l):
    head=<PyObject*>l.head
    cur=CurrentNode(<uintptr_t> head)
    while cur:
        yield cur
        cur=cur.forward()

import time

start=time.time()
cdef XORList l=XORList()
for i in range(100000):
    l.append(i)
print('time xor ', time.time()-start)

start=time.time()
l1=[]
for i in range(100000):
    l1.append(i)
print('time regular ', time.time()-start)

using the builtin list above I consistently get ~10x worse performance on the cython linked list.

time xor  0.10768294334411621
time regular  0.010972023010253906

When I profile the loop for the xorlist I get:

         700003 function calls in 1.184 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.184    1.184 <string>:1(<module>)
        1    0.039    0.039    1.184    1.184 _cython_magic_14cf45d2116440f3df600718d58e4f95.pyx:108(list_check)
   100000    0.025    0.000    0.025    0.000 _cython_magic_14cf45d2116440f3df600718d58e4f95.pyx:11(__init__)
    99999    0.019    0.000    0.019    0.000 _cython_magic_14cf45d2116440f3df600718d58e4f95.pyx:16(__get__)
    99999    0.018    0.000    0.018    0.000 _cython_magic_14cf45d2116440f3df600718d58e4f95.pyx:19(__set__)
        1    0.000    0.000    0.000    0.000 _cython_magic_14cf45d2116440f3df600718d58e4f95.pyx:60(__init__)
   100000    0.937    0.000    0.999    0.000 _cython_magic_14cf45d2116440f3df600718d58e4f95.pyx:70(append)
   100000    0.113    0.000    1.146    0.000 line_profiler.py:111(wrapper)
        1    0.000    0.000    1.184    1.184 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
   100000    0.018    0.000    0.018    0.000 {method 'disable_by_count' of '_line_profiler.LineProfiler' objects}
   100000    0.015    0.000    0.015    0.000 {method 'enable_by_count' of '_line_profiler.LineProfiler' objects}

So, ignoring the calls to append it seems most of the time is spent in the special methods.

This brings me to my questions:

how can I speed this up
I thought extension types in Cython are implemented underneath via structs so what is causing the initializations of them to take so long

I also tried another custom implementation of an oridnary doubly-linked list in pure python and the timings of it and the cython xorlist are similar within 10% difference on my machine.

Answer 1

The three culprits from your profiling look to be Node's __init__ (which is unavoidable here), and __get__ and __set__ for the prev_xor_next property. My view is that you don't want the prev_xor_next property (or if you do it should be read-only) since it makes what should be a Cython internal accessible in Python.

Whether you delete the property or not, you are working in Cython here so you can write directly to the underlying C attribute _prev_xor_next . You may need to set cdef Node cur_last at the start of append (and maybe in other functions) to ensure that Cython knows the type of cur_last - I think it should be able to work it out but if you get AttributeErrors at runtime then this is what you need to do.

This change gives me a 30% speed increase (ie it's still slower than a regular list, but it's a noticeable improvement).

I'll outline a more drastic change that I possibly should have suggested on your first question about this problem. This really is a vague outline so no effort has been made to get it to work...

Node is entirely internal to your XORList class: it should not be used from Python and the lifetime of all the Nodes in an XORList is tied directly to the list. Therefore they should be destructed on the destruction of their owning XORList (or if the list shrinks, etc) and so do not need to be reference counted. Therefore Node should be a C struct rather than a Python object:
```
 cdef struct Node: uintptr_t prev_xor_next PyObject* val # with associated constructor- and destructor-like functions: cdef Node* make_node(object val, uintptr_t prev_xor_next): cdef Node* n = <Node*>malloc(sizeof(Node)) n.val = <PyObject*>val Py_XINCREF(n.val) n.prev_xor_next = prev_xor_next return n cdef void destroy_node(Node* n): Py_XDECREF(n.val) free(n)
```
XORList needs a __dealloc__ function that loops through the list calling destroy_node on each Node (it needs a __dealloc__ function anyway in your version too!)
CurrentNode needs to remain a Cython class, since this is your "iterator" interface. It can obviously no longer inherit from Node . I'd change it to:
```
 cdef class XORListIterator: cdef Node* current_node cdef XORList our_list
```
the point of the attribute our_list is to ensure that the XORList is kept alive at least as long as the CurrentNode - if you end up with an iterator for an XORList that no longer exists that the current_node attribute will be invalid. current_node is not owned by XORListIterator so no need for a destructor.

The danger with this scheme I think is making sure that if any changes to the XORList don't completely invalidate any existing XORListIterators to the point where you get crashes. I suspect this would also be an issue with your current version.

I suspect the built-in list will still remain competitive, since it is a well-written, efficient structure. Remember that list.append is usually a simple Py_INCREF , with an occasional array reallocation and copy. Yours always involves creation of a new Python object (the Node ) as well as some associated reference counting.

My alternative scheme avoids a lot of reference counting (both in terms of computational time and "you having to think about it" time), so I'd expect it to be much closer. It retain the disadvantage of a small memory allocation each append , which is unavoidable for a linked-list structure.

Addendum : to address the comment about "the convenience of a Cython class". In my view the two advantages of using a Cython class vs a struct are:

You get something fairly close to a struct, but don't have to worry about C pointers and the reference counting is taken care of. It's pretty clear that for this problem you're doing odd things to pointers and having to handle reference counting explicitly, so I don't think this is applies to you.
You can use it from Python - you aren't just restricted to Cython. In this case I think it's entirely an implementation detail of the XORList that shouldn't be exposed to Python users.

Therefore I think the main reasons to use Cython classes specifically don't apply to your problem. (For a lot of code the advantages do apply, of course!)

It's probably also worth adding that constructing Cython classes is probably one of their slower features - to support possible inheritance the construction process is quite "indirect". You've managed to create a benchmark that turns out to be almost all constructing - I'd guess it's a slightly skewed benchmark and the real case might not be that bad.

Cython implementation no faster than pure python

Question

1 answers

solution1
1 ACCPTED 2019-10-30 08:40:12

Cython implementation no faster than pure python

Question

1 answers

solution1 1 ACCPTED 2019-10-30 08:40:12

solution1
1 ACCPTED 2019-10-30 08:40:12