I have a unicode string f. I want to memset it to 0. print f should display null (\\0)
I am using ctypes.memset to achieve this -
> >>> f
> u'abc'
> >>> print ("%s" % type(f))
> <type 'unicode'>
> >>> import ctypes
> **>>> ctypes.memset(id(f)+50,0,6)**
> **4363962530
> >>> f
> u'abc'
> >>> print f
> abc**
Why did the memory location not get memset in case of unicode string? It works perfectly for an str object.
Thanks for help.
First, this is almost certainly a very bad idea. Python expects strings to be immutable. There's a reason that even the C API won't let you change their contents after they're flagged ready. If you're just doing this to play around with the interpreter's implementation, that can be fun and instructive, but if you're doing it for any real-life purpose, you're probably doing something wrong.
In particular, if you're doing it for "security", what you almost certainly really want to do is to not create a unicode
in the first place, but instead create, say, a bytearray
with the UTF-16 or UTF-32 encoding of your string, which can be zeroed out in a way that's safe, portable, and a lot easier.
Anyway, there's no reason to expect that two completely different types should store their buffers at the same offset.
In CPython 2.x, a str
is a PyStringObject
:
typedef struct {
PyObject_VAR_HEAD
long ob_shash;
int ob_sstate;
char ob_sval[1];
} PyStringObject;
That ob_sval
is the buffer; the offset should be 36 on 64-bit builds and (I think) 24 on 32-bit builds.
In a comment, you say:
I read it somewhere and also the offset for a string type is 37 in my system which is what sys.getsizeof('') shows -> >>> sys.getsizeof('') 37
The offset for a string buffer is actually 36, not 37. And the fact that it's even that close is just a coincidence of the way str
is implemented. (Hopefully you can understand why by looking at the struct
definition—if not, you definitely shouldn't be writing code like this.) There's no reason to expect the same trick to work for some other type without looking at its implementation.
A unicode
is a PyUnicodeObject
:
typedef struct {
PyObject_HEAD
Py_ssize_t length; /* Length of raw Unicode data in buffer */
Py_UNICODE *str; /* Raw Unicode buffer */
long hash; /* Hash value; -1 if not set */
PyObject *defenc; /* (Default) Encoded version as Python
string, or NULL; this is used for
implementing the buffer protocol */
} PyUnicodeObject;
Its buffer is not even inside the object itself; that str
member is a pointer to the buffer (which is not guaranteed to be right after the struct). Its offset should be 24 on 64-bit builds, and (I think) 20 on 32-bit builds. So, to do the equivalent, you'd need to read the pointer there, then follow it to find the location to memset.
If you're using a narrow-Unicode build, it should look like this:
>>> ctypes.POINTER(ctypes.c_uint16 * len(g)).from_address(id(g)+24).contents[:]
[97, 98, 99]
That's the ctypes translation of finding (uint16_t *)(((char *)g)+24)
and reading the array that starts at *that
and ends at *(that+len(g))
, which is what you'd have to do if you were writing C code and didn't have access to the unicodeobject.h
header.
(In the the test I just quoted, g
is at 0x10a598090, while its src
points to 0x10a3b09e0
, so the buffer is not immediately after the struct, or anywhere near it; it's about 2MB before it.)
For a wide-Unicode build, the same thing with c_uint32
.
So, that should show you what you want to memset
.
And you should also see a serious implication for your attempt at "security" here. (If I have to point it out, that's yet another indication that you should not be writing this code.)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.