简体   繁体   English

Python 3中的三向比较字符串

[英]Three-way comparing strings in Python 3

Say you want to optimize a (byte) string compare intensive algorithm implemented in Python. 假设您要优化Python中实现的(字节)字符串比较密集算法。 Since a central code path contains this sequence of statements 由于中央代码路径包含此序列语句

if s < t:
    # less than ...
elif t < s:
    # greater than ...
else:
    # equal ...

it would be great to optimize it to something like 将它优化为类似的东西会很棒

r = bytes_compare(s, t)
if r < 0:
    # less than ...
elif r > 0:
    # greater than ...
else:
    # equal ...

where (the hypothetical) bytes_compare() ideally would just call the three-way comparison C function memcmp() which is usually quite well optimized. 其中(假设的) bytes_compare()理想情况下只调用三向比较 C函数memcmp() ,这通常是非常优化的。 This would reduce the number of string comparisons by half. 这会将字符串比较的数量减少一半。 A very feasible optimization unless the strings are ultra short. 一个非常可行的优化,除非字符串超短。

But how to get there with Python 3? 但是如何使用Python 3实现目标?

PS : PS

Python 3 has removed the three way comparison global function cmp() and the magic method __cmp__() . Python 3删除了三路比较全局函数cmp()和魔术方法__cmp__() And even with Python 2, the bytes class doesn't had a __cmp__() member. 即使使用Python 2, bytes类也没有__cmp__()成员。

With the ctypes package it's straight forward to call memcmp() but the foreign function call overhead with ctypes is prohibitively high. 使用ctypes包它可以直接调用memcmp()但是使用ctypes的外部函数调用开销非常高。

Python 3 (including 3.6) simply doesn't include any three-way comparison support for strings. Python 3(包括3.6)根本不包含对字符串的任何三向比较支持。 Although the internal implementation of the rich comparison operator __lt__() , __eq__() etc. do call memcmp() (in the C implementation of bytes - cf. Objects/bytesobject.c ) there is no internal three-way comparison function that could be leveraged. 虽然富比较运算符__lt__()__eq__() __lt__()等的内部实现确实调用了memcmp() (在bytes的C实现中 - 参见Objects/bytesobject.c )但是没有内部的三向比较函数可以利用。

Thus, writing a C extension that provides a three-way comparison function by calling memcmp() is the next best thing: 因此,编写一个通过调用memcmp()提供三向比较功能的C扩展是下一个最好的事情:

#include <Python.h>
static PyObject* cmp(PyObject* self, PyObject* args) {
    PyObject *a = 0, *b = 0;
    if (!PyArg_UnpackTuple(args, "cmp", 2, 2, &a, &b))
        return 0;
    if (!PyBytes_Check(a) || !PyBytes_Check(b)) {
        PyErr_SetString(PyExc_TypeError, "only bytes() strings supported");
        return 0;
    }
    Py_ssize_t n = PyBytes_GET_SIZE(a), m = PyBytes_GET_SIZE(b);
    char *s = PyBytes_AsString(a), *t = PyBytes_AsString(b);
    int r = 0;
    if (n == m) {
        r = memcmp(s, t, n);
    } else if (n < m) {
        r = memcmp(s, t, n);
        if (!r)
            r = -1;
    } else {
        r = memcmp(s, t, m);
        if (!r)
            r = 1;
    }
    return PyLong_FromLong(r);
}
static PyMethodDef bytes_util_methods[] = {
    { "cmp", cmp, METH_VARARGS, "Three way compare 2 bytes() objects." },
    {0,0,0,0} };
static struct PyModuleDef bytes_util_def = {
    PyModuleDef_HEAD_INIT, "bytes_util", "Three way comparison for strings.",
    -1, bytes_util_methods };
PyMODINIT_FUNC PyInit_bytes_util(void) {
    Py_Initialize();
    return PyModule_Create(&bytes_util_def);
}

Compile with: 编译:

gcc -Wall -O3 -fPIC -shared bytes_util.c -o bytes_util.so -I/usr/include/python3.6m

Test: 测试:

>>> import bytes_util
>>> bytes_util.cmp(b'foo', b'barx')
265725

In contrast to calling memcmp via the ctypes package, this foreign call has the same overhead as the builtin bytes comparison operators (as they also are implemented as C extension with the standard Python version). 与通过ctypes包调用memcmp相反,此外部调用具有与内置字节比较运算符相同的开销(因为它们也实现为带有标准Python版本的C扩展)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM