cython中不同字符串的相同内存地址

Question

I wrote a tree object in cython that has many nodes, each containing a single unicode character. 我在cython中编写了一个树对象，它有许多节点，每个节点都包含一个unicode字符。 I wanted to test whether the character gets interned if I use Py_UNICODE or str as the variable type. 如果我使用Py_UNICODE或str作为变量类型，我想测试字符是否被实现。 I'm trying to test this by creating multiple instances of the node class and getting the memory address of the character for each, but somehow I end up with the same memory address, even if the different instances contain different characters. 我试图通过创建节点类的多个实例并获取每个节点的字符的内存地址来测试这个，但不知何故，我最终得到相同的内存地址，即使不同的实例包含不同的字符。 Here is my code: 这是我的代码：

from libc.stdint cimport uintptr_t

cdef class Node():
    cdef:
        public str character
        public unsigned int count
        public Node lo, eq, hi

    def __init__(self, str character):
        self.character = character

    def memory(self):
        return <uintptr_t>&self.character[0]

I am trying to compare the memory locations like so, from Python: 我试图从Python中比较这样的内存位置：

a = Node("a")
a2 = Node("a")
b = Node("b")
print(a.memory(), a2.memory(), b.memory())

But the memory addresses that prints out are all the same. 但打印出来的内存地址都是一样的。 What am I doing wrong? 我究竟做错了什么？

Answer 1

Obviously, what you are doing is not what you think you would be doing. 显然，你所做的并不是你认为你会做的。

self.character[0] doesn't return the address/reference of the first character (as it would be the case for an array for example), but a Py_UCS4 -value (ie an usigned 32bit-integer), which is copied to a (local, temprorary) variable on the stack. self.character[0]不返回第一个字符的地址/引用（例如，对于数组的情况），而是Py_UCS4值（即一个32位整数），复制到堆栈上的（本地，临时）变量。

In your function, <uintptr_t>&self.character[0] gets you the address of the local variable on the stack, which per chance is always the same because when calling memory there is always the same stack-layout. 在你的函数中， <uintptr_t>&self.character[0]获取堆栈上局部变量的地址，这种情况总是相同的，因为在调用memory时总是有相同的堆栈布局。

To make it clearer, here is the difference to a char * c_string , where &c_string[0] gives you the address of the first character in c_string . 为了更清楚，这里的区别是一个char * c_string ，其中&c_string[0]给你的第一个字符的地址c_string 。

Compare: 相比：

%%cython
from libc.stdint cimport uintptr_t

cdef char *c_string = "name";
def get_addresses_from_chars():
    for i in range(4):
        print(<uintptr_t>&c_string[i])

cdef str py_string="name";
def get_addresses_from_pystr():
    for i in range(4):
        print(<uintptr_t>&py_string[i])

An now: 一个现在：

>>> get_addresses_from_chars() # works  - different addresses every time
# ...7752
# ...7753
# ...7754
# ...7755
>>> get_addresses_from_pystr() # works differently - the same address.
# ...0672 
# ...0672
# ...0672
# ...0672

You can see it this way: c_string[...] is a cdef functionality, but py_string[...] is a python-functionality and thus cannot return an address per construction. 您可以这样看： c_string[...]是一个cdef功能，但py_string[...]是一个python功能，因此不能返回每个构造的地址。

To influence the stack-layout, you could use a recursive function: 要影响堆栈布局，可以使用递归函数：

def memory(self, level):
    if level==0 :
        return <uintptr_t>&self.character[0]
    else:
        return self.memory(level-1)

Now calling it with a.memory(0) , a.memory(1) and so on will give you different addresses (unless tail-call-optimization will kick in, I don't believe it will happen, but you could disable the optimization ( -O0 ) just to be sure). 现在用a.memory(0)调用它， a.memory(1)等会给你不同的地址（除非尾调用优化会启动，我不相信它会发生，但是你可以禁用它优化（ -O0 ）只是为了确定）。 Because depending on the level /recursion-depth, the local variable, whose address will be returned, is in a different place on the stack. 因为取决于level /递归深度，将返回其地址的局部变量位于堆栈的不同位置。

To see whether Unicode-objects are interned, it is enough to use id , which yields the address of the object (this is a CPython's implementation detail) so you don't need Cython at all: 要查看是否实例化了Unicode对象，使用id就足够了，这会生成对象的地址（这是CPython的实现细节），因此您根本不需要Cython：

>>> id(a.character) == id(a2.character)
# True

or in Cython, doing the same what id does (a little bit faster): 或者在Cython中，做同样的id （更快一点）：

%%cython
from libc.stdint cimport uintptr_t
from cpython cimport PyObject
...
    def memory(self):
        # cast from object to PyObject, so the address can be used
        return <uintptr_t>(<PyObject*>self.character)

You need to cast an object to PyObject * , so the Cython will allow to take the address of the variable. 您需要将object强制转换为PyObject * ，因此Cython将允许获取变量的地址。

And now: 现在：

 >>> ...
 >>> print(a.memory(), a2.memory(), b.memory())
 # ...5800 ...5800 ...5000

If you want to get the address of the first code-point in the unicode object (which is not the same as the address of the string), you can use <PY_UNICODE *>self.character which Cython will replace by a call to PyUnicode_AsUnicode , eg: 如果要获取unicode对象中第一个代码点的地址（与字符串的地址不同），可以使用<PY_UNICODE *>self.character ，Cython将通过调用PyUnicode_AsUnicode替换PyUnicode_AsUnicode ，例如：

%%cython
...   
def memory(self):
    return <uintptr_t>(<Py_UNICODE*>self.character), id(self.character)

and now 现在

>>> ...
>>> print(a.memory(), a2.memory(), b.memory())
# (...768, ...800) (...768, ...800) (...144, ...000)

ie "a" is interned and has different address than "b" and code-points bufffer has a different address than the objects containing it (as one would expect). 即"a"被实习并且具有与"b"不同的地址，并且代码点缓冲器具有与包含它的对象不同的地址（如人们所期望的那样）。

cython中不同字符串的相同内存地址

问题描述

1 个解决方案

解决方案1
3 已采纳 2019-05-16 20:53:33

cython中不同字符串的相同内存地址

问题描述

1 个解决方案

解决方案1 3 已采纳 2019-05-16 20:53:33

解决方案1
3 已采纳 2019-05-16 20:53:33