获取字符串中不同索引数的快速方法

Question

I wanted to get number of indexes in two string which are not same. 我想获取两个字符串中不相同的索引数。

Things that are fixed: 已解决的问题：

String data will only have 0 or 1 on any index. 字符串数据在任何索引上只能有0或1。 ie strings are binary representation of a number. 即字符串是数字的二进制表示。

Both the string will be of same length. 两个字符串的长度相同。

For the above problem I wrote the below function in python 对于上述问题，我在python中编写了以下函数

def foo(a,b):
    result = 0
    for x,y in zip(a,b):
        if x != y:
            result += 1
    return result

But the thing is these strings are huge. 但问题是这些字符串很大。 Very large. 很大。 So the above functions is taking too much time. 因此上述功能花费了太多时间。 any thing i should do to make it super fast. 我应该做的任何事情都可以使其变得超快。

This is how i did same in c++, Its quite fast now, but still can't understand how to do packing in short integers and all that said by @Yves Daoust : 这就是我在c ++中所做的相同的方法，它现在非常快，但是仍然不明白如何用短整数以及@Yves Daoust所说的所有内容进行打包：

size_t diff(long long int n1, long long int n2)
{
long long int c = n1 ^ n2;
bitset<sizeof(int) * CHAR_BIT> bits(c);
string s = bits.to_string();

return std::count(s.begin(), s.end(), '1');

}

Answer 1

I'll walk through the options here, but basically you are calculating the hamming distance between two numbers. 我将在这里介绍这些选项，但是基本上您是在计算两个数字之间的汉明距离。 There are dedicated libraries that can make this really, really fast, but lets focus on the pure Python options first. 有专用的库可以使此操作真正非常快，但让我们首先关注纯Python选项。

Your approach, zipping 您的方法，拉链

zip() produces one big list first , then lets you loop. zip()产生一个大名单，然后再允许您循环。 You could use itertools.izip() instead, and make it a generator expression: 您可以改用itertools.izip() ，并使其成为生成器表达式：

from itertools import izip

def foo(a, b):
    return sum(x != y for x, y in izip(a, b))

This produces only one pair at a time, avoiding having to create a large list of tuples first. 这样一次只产生一对，从而避免了必须先创建大量元组的情况。

The Python boolean type is a subclass of int , where True == 1 and False == 0 , letting you sum them: Python布尔类型是int的子类，其中True == 1和False == 0 ，让您对其进行求和：

>>> True + True
2

Using integers instead 改用整数

However, you probably want to rethink your input data. 但是，您可能想重新考虑输入数据。 It's much more efficient to use integers to represent your binary data; 使用整数表示二进制数据效率更高。 integers can be operated on directly. 整数可以直接操作。 Doing the conversion inline, then counting the number of 1s on the XOR result is: 进行内联转换，然后对XOR结果的1计数：

def foo(a, b):
    return format(int(a, 2) ^ int(b, 2), 'b').count('1')

but not having to convert a and b to integers in the first place would be much more efficient. 但不必首先将a和b转换为整数会更有效。

Time comparisons: 时间比较：

>>> from itertools import izip
>>> import timeit
>>> s1 = "0100010010"
>>> s2 = "0011100010"
>>> def foo_zipped(a, b): return sum(x != y for x, y in izip(a, b))
... 
>>> def foo_xor(a, b): return format(int(a, 2) ^ int(b, 2), 'b').count('1')
... 
>>> timeit.timeit('f(s1, s2)', 'from __main__ import s1, s2, foo_zipped as f')
1.7872788906097412
>>> timeit.timeit('f(s1, s2)', 'from __main__ import s1, s2, foo_xor as f')
1.3399651050567627
>>> s1 = s1 * 1000
>>> s2 = s2 * 1000
>>> timeit.timeit('f(s1, s2)', 'from __main__ import s1, s2, foo_zipped as f', number=1000)
1.0649528503417969
>>> timeit.timeit('f(s1, s2)', 'from __main__ import s1, s2, foo_xor as f', number=1000)
0.0779869556427002

The XOR approach is faster by orders of magnitude if the inputs get larger, and this is with converting the inputs to int first. 异或的方法是快了几个数量级，如果输入变大，这是转换投入int第一。

Dedicated libraries for bitcounting 专用库用于位计数

The bit counting ( format(integer, 'b').count(1) ) is pretty fast, but can be made faster still if you installed the gmpy extension library (a Python wrapper around the GMP library ) and used the gmpy.popcount() function: 位计数（ format(integer, 'b').count(1) ）非常快，但是如果您安装了gmpy扩展库（围绕GMP库的Python包装器）并使用了gmpy.popcount() ，则位gmpy.popcount()函数：

def foo(a, b):
    return gmpy.popcount(int(a, 2) ^ int(b, 2))

gmpy.popcount() is about 20 times faster on my machine than the str.count() method. gmpy.popcount()更快我的机器上约20倍，比str.count()方法。 Again, not having to convert a and b to integers to begin with would remove another bottleneck, but even then there per-call performance is almost doubled: 同样，不必先将a和b转换为整数将消除另一个瓶颈，但是即使那样，每次调用的性能也几乎翻了一番：

>>> import gmpy
>>> def foo_xor_gmpy(a, b): return gmpy.popcount(int(a, 2) ^ int(b, 2))
... 
>>> timeit.timeit('f(s1, s2)', 'from __main__ import s1, s2, foo_xor as f', number=10000)
0.7225301265716553
>>> timeit.timeit('f(s1, s2)', 'from __main__ import s1, s2, foo_xor_gmpy as f', number=10000)
0.47731995582580566

To illustrate the difference when a and b are integers to begin with: 为了说明当a和b为整数开头时的区别：

>>> si1, si2 = int(s1, 2), int(s2, 2)
>>> def foo_xor_int(a, b): return format(a ^ b, 'b').count('1')
... 
>>> def foo_xor_gmpy_int(a, b): return gmpy.popcount(a ^ b)
... 
>>> timeit.timeit('f(si1, si2)', 'from __main__ import si1, si2, foo_xor_int as f', number=100000)
3.0529568195343018
>>> timeit.timeit('f(si1, si2)', 'from __main__ import si1, si2, foo_xor_gmpy_int as f', number=100000)
0.15820622444152832

Dedicated libraries for hamming distances 汉明距离专用库

The gmpy library actually includes a gmpy.hamdist() function, which calculates this exact number (the number of 1 bits in the XOR result of the integers) directly : 所述gmpy库实际上包括gmpy.hamdist()函数，其计算直接此确切数量（1位的整数的XOR结果的数量）：

def foo_gmpy_hamdist(a, b):
    return gmpy.hamdist(int(a, 2), int(b, 2))

which'll blow your socks off entirely if you used integers to begin with: 如果您使用整数开头的话，这将彻底打击您的袜子：

def foo_gmpy_hamdist_int(a, b):
    return gmpy.hamdist(a, b)

Comparisons: 比较：

>>> def foo_gmpy_hamdist(a, b):
...     return gmpy.hamdist(int(a, 2), int(b, 2))
... 
>>> def foo_gmpy_hamdist_int(a, b):
...     return gmpy.hamdist(a, b)
... 
>>> timeit.timeit('f(s1, s2)', 'from __main__ import s1, s2, foo_xor as f', number=100000)
7.479684114456177
>>> timeit.timeit('f(s1, s2)', 'from __main__ import s1, s2, foo_gmpy_hamdist as f', number=100000)
4.340585947036743
>>> timeit.timeit('f(si1, si2)', 'from __main__ import si1, si2, foo_gmpy_hamdist_int as f', number=100000)
0.22896099090576172

That's 100.000 times the hamming distance between two 3k+ digit numbers. 这是两个3k +数字之间的汉明距离的100.000倍。

Another package that can calculate the distance is Distance , which supports calculating the hamming distance between strings directly. 另一个可以计算距离的包是Distance ，它支持直接计算字符串之间的汉明距离。

Make sure you use the --with-c switch to have it compile the C optimisations; 确保使用--with-c开关编译C优化。 when installing with pip use bin/pip install Distance --install-option --with-c for example. 与安装时pip使用bin/pip install Distance --install-option --with-c的例子。

Benchmarking this against the XOR-with-bitcount approach again: 再次将其与按比特数异或方法进行基准测试：

>>> import distance
>>> def foo_distance_hamming(a, b):
...     return distance.hamming(a, b)
... 
>>> timeit.timeit('f(s1, s2)', 'from __main__ import s1, s2, foo_xor as f', number=100000)
7.229060173034668
>>> timeit.timeit('f(s1, s2)', 'from __main__ import s1, s2, foo_distance_hamming as f', number=100000)
0.7701470851898193

It uses the naive approach; 它使用幼稚的方法； zip over both input strings and count the number of differences, but since it does this in C it is still plenty faster, about 10 times as fast. 压缩两个输入字符串并计算差异的数量，但是由于它是用C语言完成的，因此速度仍然要快得多，大约是它的10倍。 The gmpy.hamdist() function still beats it when you use integers, however. 但是，当您使用整数时， gmpy.hamdist()函数仍然会胜过它。

Answer 2

未经测试，但是如何执行：

sum(x!=y for x,y in zip(a,b))

Answer 3

If the strings represent binary numbers, you can convert to integers and use bitwise operators: 如果字符串表示二进制数，则可以转换为整数并使用按位运算符：

def foo(s1, s2):
    # return sum(map(int, format(int(a, 2) ^ int(b, 2), 'b'))) # one-liner
    a = int(s1, 2) # convert string to integer 
    b = int(s2, 2)
    c = a ^ b # use xor to get differences
    s = format(c, 'b') # convert back to string of zeroes and ones
    return sum(map(int, s)) # sum all ones (count of differences)

s1 = "0100010010"
s2 = "0011100010"
     # 12345

assert foo(s1, s2) == 5

Answer 4

Pack your strings as short integers (16 bits). 将您的字符串打包为短整数（16位）。 After xoring, pass to a precomputed lookup table of 65536 entries that gives the number of 1s per short. 异或后，传递到65536个条目的预先计算的查找表，该表给出每短1的数目。

If pre-packing is not an option, switch to C++ with inline AVX2 intrinsics. 如果无法进行预打包，请使用内联AVX2内部函数切换到C ++。 They will allow you to load 32 characters in a single instruction, perform the comparisons, then pack the 32 results to 32 bits (if I am right). 它们将使您可以在一条指令中加载32个字符，执行比较，然后将32个结果打包为32位（如果我是对的话）。

获取字符串中不同索引数的快速方法

问题描述

4 个解决方案

解决方案1
3 2014-05-31 11:16:01

Your approach, zipping 您的方法，拉链

Using integers instead 改用整数

Dedicated libraries for bitcounting 专用库用于位计数

Dedicated libraries for hamming distances 汉明距离专用库

解决方案2
0 2014-05-31 11:13:40

解决方案3
0 2014-05-31 11:18:28

解决方案4
0 2014-05-31 14:25:20

获取字符串中不同索引数的快速方法

问题描述

4 个解决方案

解决方案1 3 2014-05-31 11:16:01

Your approach, zipping 您的方法，拉链

Using integers instead 改用整数

Dedicated libraries for bitcounting 专用库用于位计数

Dedicated libraries for hamming distances 汉明距离专用库

解决方案2 0 2014-05-31 11:13:40

解决方案3 0 2014-05-31 11:18:28

解决方案4 0 2014-05-31 14:25:20

解决方案1
3 2014-05-31 11:16:01

解决方案2
0 2014-05-31 11:13:40

解决方案3
0 2014-05-31 11:18:28

解决方案4
0 2014-05-31 14:25:20