简体   繁体   English

在Python中将字符串转换为ctypes.c_ubyte数组的有效方法

[英]Efficient way to convert string to ctypes.c_ubyte array in Python

I have a string of 20 bytes, and I would like to convert it a ctypes.c_ubyte array for bit field manipulation purposes. 我有一个20字节的字符串,我想将它转换为ctypes.c_ubyte数组用于位字段操作。

 import ctypes
 str_bytes = '01234567890123456789'
 byte_arr = bytearray(str_bytes)
 raw_bytes = (ctypes.c_ubyte*20)(*(byte_arr))

Is there a way to avoid a deep copy from str to bytearray for the sake of the cast? 有没有办法避免为了演员而从str到bytearray的深拷贝?

Alternatively, is it possible to convert a string to a bytearray without a deep copy? 或者,是否可以在没有深层复制的情况下将字符串转换为bytearray? (With techniques like memoryview?) (使用memoryview等技术?)

I am using Python 2.7. 我使用的是Python 2.7。

Performance results: 表现结果:

Using eryksun and Brian Larsen 's suggestion, here are the benchmarks under a vbox VM with Ubuntu 12.04 and Python 2.7. 使用eryksunBrian Larsen的建议,这里是使用Ubuntu 12.04和Python 2.7的vbox VM下的基准测试。

  • method1 uses my original post method1使用我的原始帖子
  • method2 uses ctype from_buffer_copy method2使用ctype from_buffer_copy
  • method3 uses ctype cast/POINTER method3使用ctype cast / POINTER
  • method4 uses numpy method4使用numpy

Results: 结果:

  • method1 takes 3.87sec method1需要3.87秒
  • method2 takes 0.42sec method2需要0.42秒
  • method3 takes 1.44sec method3需要1.44秒
  • method4 takes 8.79sec method4需要8.79秒

Code: 码:

import ctypes
import time
import numpy

str_bytes = '01234567890123456789'

def method1():
    result = ''
    t0 = time.clock()
    for x in xrange(0,1000000):     
        byte_arr = bytearray(str_bytes)
        result = (ctypes.c_ubyte*20)(*(byte_arr))

    t1 = time.clock()
    print(t1-t0)

    return result

def method2():

    result = ''
    t0 = time.clock()
    for x in xrange(0,1000000):     
        result = (ctypes.c_ubyte * 20).from_buffer_copy(str_bytes)

    t1 = time.clock()
    print(t1-t0)

    return result

def method3():

    result = ''
    t0 = time.clock()
    for x in xrange(0,1000000):     
        result = ctypes.cast(str_bytes, ctypes.POINTER(ctypes.c_ubyte * 20))[0]

    t1 = time.clock()
    print(t1-t0)

    return result

def method4():

    result = ''
    t0 = time.clock()
    for x in xrange(0,1000000):     
        arr = numpy.asarray(str_bytes)
        result = arr.ctypes.data_as(ctypes.POINTER(ctypes.c_ubyte*len(str_bytes)))

    t1 = time.clock()
    print(t1-t0)

    return result

print(method1())
print(method2())
print(method3())
print(method4())

I don't that's working how you think. 我不这样做你的想法。 bytearray creates a copy of the string. bytearray创建字符串的副本。 Then the interpreter unpacks the bytearray sequence into a starargs tuple and merges this into another new tuple that has the other args (even though there are none in this case). 然后解释拆包bytearray序列为starargs tuple和合并到另一个新的这种tuple具有其他ARGS(即使有没有在这种情况下)。 Finally, the c_ubyte array initializer loops over the args tuple to set the elements of the c_ubyte array. 最后, c_ubyte数组初始化器遍历args tuple以设置c_ubyte数组的元素。 That's a lot of work, and a lot of copying, to go through just to initialize the array. 这需要大量的工作和大量的复制才能完成初始化阵列。

Instead you can use the from_buffer_copy method, assuming the string is a bytestring with the buffer interface (not unicode): 相反,你可以使用from_buffer_copy方法,假设字符串是带缓冲区接口的字节串(不是unicode):

import ctypes    
str_bytes = '01234567890123456789'
raw_bytes = (ctypes.c_ubyte * 20).from_buffer_copy(str_bytes)

That still has to copy the string, but it's only done once, and much more efficiently. 这仍然需要复制字符串,但它只执行一次,效率更高。 As was stated in the comments, a Python string is immutable and could be interned or used as a dict key. 正如评论中所述,Python字符串是不可变的,可以实现或用作dict键。 Its immutability should be respected, even if ctypes lets you violate this in practice: 它的不变性应该得到尊重,即使ctypes允许你在实践中违反这一点:

>>> from ctypes import *
>>> s = '01234567890123456789'
>>> b = cast(s, POINTER(c_ubyte * 20))[0]
>>> b[0] = 97
>>> s
'a1234567890123456789'

Edit 编辑

I need to emphasize that I am not recommending using ctypes to modify an immutable CPython string. 我需要强调的是,我不建议使用ctypes来修改不可变的CPython字符串。 If you have to, then at the very least check sys.getrefcount beforehand to ensure that the reference count is 2 or less (the call adds 1). 如果必须,则至少事先检查sys.getrefcount以确保引用计数为2或更少(调用加1)。 Otherwise, you will eventually be surprised by string interning for names (eg "sys" ) and code object constants. 否则,您最终会对名称(例如"sys" )和代码对象常量的字符串实习感到惊讶。 Python is free to reuse immutable objects as it sees fit. Python可以自由地重用不可变对象。 If you step outside of the language to mutate an 'immutable' object, you've broken the contract. 如果你走出语言来改变一个“不可变”的对象,你就违反了合同。

For example, if you modify an already-hashed string, the cached hash is no longer correct for the contents. 例如,如果修改已经散列的字符串,则缓存的散列不再适用于内容。 That breaks it for use as a dict key. 这打破了它作为dict键使用。 Neither another string with the new contents nor one with the original contents will match the key in the dict. 具有新内容的另一个字符串或具有原始内容的字符串都不会与字典中的键匹配。 The former has a different hash, and the latter has a different value. 前者具有不同的哈希值,后者具有不同的值。 Then the only way to get at the dict item is by using the mutated string that has the incorrect hash. 然后,获取dict项的唯一方法是使用具有错误哈希的变异字符串。 Continuing from the previous example: 继续前一个例子:

>>> s
'a1234567890123456789'
>>> d = {s: 1}
>>> d[s]
1

>>> d['a1234567890123456789']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'a1234567890123456789'

>>> d['01234567890123456789']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: '01234567890123456789'

Now consider the mess if the key is an interned string that's reused in dozens of places. 如果密钥是一个在几十个地方重用的实习字符串,现在考虑一下这个混乱。


For performance analysis it's typical to use the timeit module. 对于性能分析,通常使用timeit模块。 Prior to 3.3, timeit.default_timer varies by platform. 在3.3之前, timeit.default_timer因平台而异。 On POSIX systems it's time.time , and on Windows it's time.clock . 在POSIX系统上,它是time.time ,在Windows上它是time.clock

import timeit

setup = r'''
import ctypes, numpy
str_bytes = '01234567890123456789'
arr_t = ctypes.c_ubyte * 20
'''

methods = [
  'arr_t(*bytearray(str_bytes))',
  'arr_t.from_buffer_copy(str_bytes)',
  'ctypes.cast(str_bytes, ctypes.POINTER(arr_t))[0]',
  'numpy.asarray(str_bytes).ctypes.data_as('
      'ctypes.POINTER(arr_t))[0]',
]

test = lambda m: min(timeit.repeat(m, setup))

>>> tabs = [test(m) for m in methods]
>>> trel = [t / tabs[0] for t in tabs]
>>> trel
[1.0, 0.060573711879182784, 0.261847116395079, 1.5389279092185282]

As another solution for you to benchmark (I would be very interested in the results). 作为另一种解决方案供您进行基准测试(我对结果非常感兴趣)。

Using numpy might add some simplicity depending on what the whole code looks like. 使用numpy可能会增加一些简单性,具体取决于整个代码的外观。

import numpy as np
import ctypes
str_bytes = '01234567890123456789'
arr = np.asarray(str_bytes)
aa = arr.ctypes.data_as(ctypes.POINTER(ctypes.c_ubyte*len(str_bytes)))
for v in aa.contents: print v
48
49
50
51
52
53
54
55
56
57
48
49
50
51
52
53
54
55
56
57

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM