简体   繁体   English

为什么python中的字符串比较如此之快?

[英]Why is string comparison so fast in python?

I became curious to understand the internals of how string comparison works in python when I was solving the following example algorithm problem:当我解决以下示例算法问题时,我很想了解字符串比较在 python 中的工作原理:

Given two strings, return the length of the longest common prefix给定两个字符串,返回最长公共前缀的长度

Solution 1: charByChar解决方案 1:charByChar

My intuition told me that the optimal solution would be to start with one cursor at the beginning of both words and iterate forward until the prefixes no longer match.我的直觉告诉我,最佳解决方案是从两个单词开头的一个光标开始,然后向前迭代,直到前缀不再匹配。 Something like就像是

def charByChar(smaller, bigger):
  assert len(smaller) <= len(bigger)
  for p in range(len(smaller)):
    if smaller[p] != bigger[p]:
      return p
  return len(smaller)

To simplify the code, the function assumes that the length of the first string, smaller , is always smaller than or equal to the length of the second string, bigger .为了简化代码,该函数假定第一个字符串的长度smaller ,始终小于或等于第二个字符串的长度bigger

Solution 2: binarySearch解决方案 2:binarySearch

Another method is to bisect the two strings to create two prefix substrings.另一种方法是将两个字符串平分以创建两个前缀子字符串。 If the prefixes are equal, we know that the common prefix point is at least as long as the midpoint.如果前缀相等,我们知道公共前缀点至少和中点一样长。 Otherwise the common prefix point is at least no bigger than the midpoint.否则,公共前缀点至少不大于中点。 We can then recurse to find the prefix length.然后我们可以递归找到前缀长度。

Aka binary search.又称二分查找。

def binarySearch(smaller, bigger):
  assert len(smaller) <= len(bigger)
  lo = 0
  hi = len(smaller)

  # binary search for prefix
  while lo < hi:
    # +1 for even lengths
    mid = ((hi - lo + 1) // 2) + lo

    if smaller[:mid] == bigger[:mid]:
      # prefixes equal
      lo = mid
    else:
      # prefixes not equal
      hi = mid - 1

  return lo

At first I assumed that that binarySearch would be slower because string comparison would compare all characters several times rather than just the prefix characters as in charByChar .起初我认为binarySearch会更慢,因为字符串比较会比较所有字符多次,而不是像charByChar那样只比较前缀字符。

Surpisingly, the binarySearch turned out to be much faster after some preliminary benchmarking.令人惊讶的是,经过一些初步的基准测试, binarySearch变得更快了。

Figure A图A

lcp_fixed_suffix

Above shows how performance is affected as prefix length is increased.上面显示了随着前缀长度的增加,性能如何受到影响。 Suffix length remains constant at 50 characters.后缀长度保持不变,为 50 个字符。

This graph shows two things:这张图显示了两件事:

  1. As expected, both algorithms perform linearly worse as prefix length increases.正如预期的那样,随着前缀长度的增加,两种算法的线性表现都会变差。
  2. Performance of charByChar degrades at a much faster rate. charByChar的性能以更快的速度下降。

Why is binarySearch so much better?为什么binarySearch这么好? I think it is because我认为这是因为

  1. The string comparison in binarySearch is presumably optimized by the interpreter / CPU behind the scenes. binarySearch中的字符串比较大概是由幕后的解释器/CPU 优化的。
  2. charByChar actually creates new strings for each character accessed and this produces significant overhead. charByChar实际上为每个访问的字符创建新字符串,这会产生大量开销。

To validate this I benchmarked the performance of comparing and slicing a string, labelled cmp and slice respectively below.为了验证这一点,我对比较和切片字符串的性能进行了基准测试,分别在下面标记为cmpslice

Figure B图B

CMP

This graph show two important things:这张图显示了两个重要的事情:

  1. As expected, comparing and slicing increase linearly with length.正如预期的那样,比较和切片随长度线性增加。
  2. The cost of comparing and slicing increase very slowly with length relative to algorithm performance, Figure A. Note both figures go up to strings of length 1 Billion characters.相对于算法性能,比较和切片的成本随着长度的增长非常缓慢,图 A。请注意,这两个数字都上升到长度为 10 亿个字符的字符串。 Therefore, the cost of comparing 1 character 1 Billion times is much much greater than comparing 1 Billion characters once.因此,比较 1 个字符 10 亿次的成本要比比较 10 亿个字符一次要大得多。 But this still doesn't answer why ...但这仍然没有回答为什么......

Cpython蟒蛇

In an effort to find out how the cpython interpreter optimizes string comparison I generated the byte code for the following function.为了找出 cpython 解释器如何优化字符串比较,我为以下函数生成了字节码。

In [9]: def slice_cmp(a, b): return a[0] == b[0]

In [10]: dis.dis(slice_cmp)
            0 LOAD_FAST                0 (a)
            2 LOAD_CONST               1 (0)
            4 BINARY_SUBSCR
            6 LOAD_FAST                1 (b)
            8 LOAD_CONST               1 (0)
           10 BINARY_SUBSCR
           12 COMPARE_OP               2 (==)
           14 RETURN_VALUE

I poked around the cpython code and found the following two pieces of code but I'm not sure this is where string comparison occurs.我浏览了cpython代码,发现了以下两段代码,但我不确定这是发生字符串比较的地方。

The question问题

  • Where in the cpython does string comparison occur?字符串比较在 cpython 中的哪个位置发生?
  • Is there a CPU optimization?有CPU优化吗? Is there a special x86 instruction which does string comparison?是否有特殊的 x86 指令可以进行字符串比较? How can I see what assembly instructions are generated by cpython?如何查看 cpython 生成了哪些汇编指令? You may assume I am using python3 latest, Intel Core i5, OS X 10.11.6.您可能会认为我使用的是最新的 python3、Intel Core i5、OS X 10.11.6。
  • Why is comparing a long string so much faster than comparing each of it's characters?为什么比较长字符串比比较每个字符要快得多?

Bonus question: When is charByChar more performant?额外问题:charByChar 什么时候性能更高?

If the prefix is sufficiently small in comparison to the length rest of the string, at some point the cost of creating substrings in charByChar becomes less than the cost of comparing the substrings in binarySearch .如果前缀与字符串的其余长度相比足够小,则在某些时候,在charByChar中创建子字符串的成本会小于在binarySearch中比较子字符串的成本。

To describe this relationship I delved into runtime analysis.为了描述这种关系,我深入研究了运行时分析。

Runtime analysis运行时分析

To simplify the below equations, let's assume that smaller and bigger are the same size and I will refer to them as s1 and s2 .为了简化下面的等式,我们假设smallerbigger的大小相同,我将它们称为s1s2

charByChar逐字符

charByChar(s1, s2) = costOfOneChar * prefixLen

Where the在哪里

costOfOneChar = cmp(1) + slice(s1Len, 1) + slice(s2Len, 1)

Where cmp(1) is the cost of comparing two strings of length 1 char.其中cmp(1)是比较两个长度为 1 字符的字符串的成本。

slice is the cost of accessing a char, the equivalent of charAt(i) . slice是访问 char 的成本,相当于charAt(i) Python has immutable strings and accessing a char actually creates a new string of length 1. slice(string_len, slice_len) is the cost of slicing a string of length string_len to a slice of size slice_len . Python 具有不可变的字符串,访问 char 实际上会创建一个长度为 1 的新字符串。 slice(string_len, slice_len)是将长度为 string_len 的字符串切片为大小为string_len的切片的slice_len

So所以

charByChar(s1, s2) = O((cmp(1) + slice(s1Len, 1)) * prefixLen)

binarySearch二进制搜索

binarySearch(s1, s2) = costOfHalfOfEachString * log_2(s1Len)

log_2 is the number of times to divide the strings in half until reaching a string of length 1. Where log_2是将字符串分成两半直到达到长度为 1 的字符串的次数。其中

costOfHalfOfEachString = slice(s1Len, s1Len / 2) + slice(s2Len, s1Len / 2) + cmp(s1Len / 2)

So the big-O of binarySearch will grow according to所以binarySearch的 big-O 会根据

binarySearch(s1, s2) = O((slice(s2Len, s1Len) + cmp(s1Len)) * log_2(s1Len))

Based on our previous analysis of the cost of根据我们之前对成本的分析

If we assume that costOfHalfOfEachString is approximately the costOfComparingOneChar then we can refer to them both as x .如果我们假设costOfHalfOfEachString大约是costOfComparingOneChar那么我们可以将它们都称为x

charByChar(s1, s2) = O(x * prefixLen)
binarySearch(s1, s2) = O(x * log_2(s1Len))

If we equate them如果我们把它们等同起来

O(charByChar(s1, s2)) = O(binarySearch(s1, s2))
x * prefixLen = x * log_2(s1Len)
prefixLen = log_2(s1Len)
2 ** prefixLen = s1Len

So O(charByChar(s1, s2)) > O(binarySearch(s1, s2) when所以O(charByChar(s1, s2)) > O(binarySearch(s1, s2)

2 ** prefixLen = s1Len

So plugging in the above formula I regenerated tests for Figure A but with strings of total length 2 ** prefixLen expecting the performance of the two algorithms to be roughly equal.因此,插入上面的公式,我为图 A 重新生成了测试,但总长度为2 ** prefixLen的字符串期望两种算法的性能大致相等。

图像

However, clearly charByChar performs much better.但是,显然charByChar的性能要好得多。 With a bit of trial and error, the performance of the two algorithms are roughly equal when s1Len = 200 * prefixLen经过一些尝试和错误,当s1Len = 200 * prefixLen时,两种算法的性能大致相等

图像

Why is the relationship 200x?为什么关系是 200 倍?

TL:DR : a slice compare is some Python overhead + a highly-optimized memcmp (unless there's UTF-8 processing?). TL:DR :切片比较是一些 Python 开销 + 高度优化的memcmp (除非有 UTF-8 处理?)。 Ideally, use slice compares to find the first mismatch to within less than 128 bytes or something, then loop a char at a time.理想情况下,使用切片比较来找到小于 128 字节或其他内容的第一个不匹配,然后一次循环一个字符。

Or if it's an option and the problem is important, make a modified copy of an asm-optimized memcmp that returns the position of the first difference, instead of equal/not-equal;或者,如果它是一个选项并且问题很重要,则制作一个经过 asm 优化的memcmp的修改副本,它返回第一个差异的位置,而不是相等/不相等; it will run as fast as a single == of the whole strings.它将与整个字符串的单个==一样快。 Python has ways to call native C / asm functions in libraries. Python 可以在库中调用本机 C / asm 函数。

It's a frustrating limitation that the CPU can do this blazingly fast, but Python doesn't (AFAIK) give you access to an optimized compare loop that tells you the mismatch position instead of just equal / greater / less. CPU 可以以极快的速度执行此操作,这是一个令人沮丧的限制,但 Python (AFAIK) 无法让您访问优化的比较循环,该循环告诉您不匹配位置,而不仅仅是等于/大于/小于。


It's totally normal that interpreter overhead dominates the cost of the real work in a simple Python loop, with CPython .使用 CPython 的简单 Python 循环中,解释器开销占实际工作的成本是完全正常的。 Building an algorithm out of optimized building blocks is worth it even if it means doing more total work.用优化的构建块构建算法是值得的,即使这意味着做更多的工作。 This is why NumPy is good, but looping over a matrix element-by-element is terrible.这就是为什么 NumPy 很好,但逐个元素循环遍历矩阵是很糟糕的。 The speed difference might be something like a factor of 20 to 100, for CPython vs. a compiled C (asm) loop for comparing one byte at a time (made up numbers, but probably right to within an order of magnitude).对于 CPython 与用于一次比较一个字节的编译 C (asm) 循环(由数字组成,但可能在一个数量级之内),速度差异可能是 20 到 100 倍。

Comparing blocks of memory for equality is probably one of the biggest mismatches between Python loops vs. operating on a whole list / slice.比较内存块是否相等可能是 Python 循环与在整个列表/切片上操作之间最大的不匹配之一。 It's a common problem with highly-optimized solutions (eg most libc implementations (including OS X) have a manually-vectorized hand-coded asm memcmp that uses SIMD to compare 16 or 32 bytes in parallel, and runs much faster than a byte-at-a-time loop in C or assembly).这是高度优化的解决方案的常见问题(例如,大多数 libc 实现(包括 OS X)都有一个手动矢量化的手工编码的 asm memcmp ,它使用 SIMD 并行比较 16 或 32 个字节,并且运行速度比 byte-at快得多-C 或汇编中的时间循环)。 So there's another factor of 16 to 32 (if memory bandwidth isn't a bottleneck) multiplying the factor of 20 to 100 speed difference between Python and C loops.因此,还有另一个 16 到 32 的因数(如果内存带宽不是瓶颈)将 Python 和 C 循环之间的速度差异乘以 20 到 100 的因数。 Or depending on how optimized your memcmp is, maybe "only" 6 or 8 bytes per cycle.或者取决于您的memcmp的优化程度,每个周期可能“仅”6 或 8 个字节。

With data hot in L2 or L1d cache for medium-sized buffers, it's reasonable to expect 16 or 32 bytes per cycle for memcmp on a Haswell or later CPU.由于数据在 L2 或 L1d 高速缓存中用于中等大小的缓冲区,因此在 Haswell 或更高版本的 CPU 上预期memcmp每个周期有 16 或 32 个字节是合理的。 (i3/i5/i7 naming started with Nehalem; i5 alone is not sufficient to tell us much about your CPU.) (i3/i5/i7 的命名始于 Nehalem;仅 i​​5 不足以告诉我们有关您的 CPU 的很多信息。)

I'm not sure if either or both of your comparisons are having to process UTF-8 and check for equivalency classes or different ways to encode the same character.我不确定您的比较中的一个或两个是否必须处理 UTF-8 并检查等效类或编码相同字符的不同方法。 The worst case is if your Python char-at-a-time loop has to check for potentially-multi-byte characters but your slice compare can just use memcmp .最坏的情况是,如果您的 Python 一次 char 循环必须检查潜在的多字节字符,但您的切片比较只能使用memcmp


Writing an efficient version in Python:用 Python 编写一个高效的版本:

We're just totally fighting against the language to get efficiency: your problem is almost exactly the same as the C standard library function memcmp , except you want the position of the first difference instead of a - / 0 / + result telling you which string is greater.我们只是为了提高效率而完全与语言作斗争:您的问题几乎与 C 标准库函数memcmp完全相同,只是您想要第一个差异的位置而不是 - / 0 / + 结果告诉您哪个字符串更伟大。 The search loop is identical, it's just a difference in what the function does after finding the result.搜索循环是相同的,只是在找到结果后函数的作用有所不同。

Your binary search is not the best way to use a fast compare building block.您的二进制搜索不是使用快速比较构建块的最佳方式。 A slice compare still has O(n) cost, not O(1) , just with a much smaller constant factor.切片比较仍然有O(n)成本,而不是O(1) ,只是常数因子要小得多。 You can and should avoid re-comparing the starts of the buffers repeatedly by using slices to compare large chunks until you find a mismatch, then go back over that last chunk with a smaller chunk size.您可以并且应该避免重复比较缓冲区的开头,方法是使用切片比较大块,直到发现不匹配,然后以较小的块大小返回最后一个块。

# I don't actually know Python; consider this pseudo-code
# or leave an edit if I got this wrong :P
chunksize = min(8192, len(smaller))
# possibly round chunksize down to the next lowest power of 2?
start = 0
while start+chunksize < len(smaller):
    if smaller[start:start+chunksize] == bigger[start:start+chunksize]:
        start += chunksize
    else:
        if chunksize <= 128:
            return char_at_a_time(smaller[start:start+chunksize],  bigger[start:start+chunksize])
        else:
            chunksize /= 8        # from the same start

# TODO: verify this logic for corner cases like string length not a power of 2
# and/or a difference only in the last character: make sure it does check to the end

I chose 8192 because your CPU has a 32kiB L1d cache, so the total cache footprint of two 8k slices is 16k, half your L1d.我选择 8192 是因为你的 CPU 有一个 32kiB L1d 缓存,所以两个 8k 切片的总缓存占用空间是 16k,是 L1d 的一半。 When the loop finds a mismatch, it will re-scan the last 8kiB in 1k chunks, and these compares will loop over data that's still hot in L1d.当循环发现不匹配时,它将重新扫描 1k 块中的最后 8kiB,这些比较将循环遍历 L1d 中仍然很热的数据。 (Note that if == found a mismatch, it probably only touched data up to that point, not the whole 8k. But HW prefetch will keep going somewhat beyond that.) (请注意,如果==发现不匹配,它可能只触及到该点的数据,而不是整个 8k。但硬件预取将继续超出此范围。)

A factor of 8 should be a good balance between using large slices to localize quickly vs. not needing many passes over the same data. 8 倍应该是使用大切片快速定位与不需要多次遍历相同数据之间的良好平衡。 This is a tunable parameter of course, along with chunk size.这当然是一个可调参数,还有块大小。 The bigger the mismatch between Python and asm, the smaller this factor should be to reduce Python loop iterations.) Python 和 asm 之间的不匹配越大,这个因素应该越小,以减少 Python 循环迭代。)

Hopefully 8k is big enough to hide the Python loop / slice overhead;希望 8k 足够大,可以隐藏 Python 循环/切片开销; hardware prefetching should still be working during the Python overhead between memcmp calls from the interpreter so we don't need the granularity to be huge.在来自解释器的memcmp调用之间的 Python 开销期间,硬件预取应该仍然有效,因此我们不需要很大的粒度。 But for really big strings, if 8k doesn't saturate memory bandwidth then maybe make it 64k (your L2 cache is 256kiB; i5 does tell us that much).但是对于非常大的字符串,如果 8k 不会使内存带宽饱和,那么可能会使其达到 64k(您的 L2 缓存是 256kiB;i5 确实告诉我们很多)。

How exactly is memcmp so fast: memcmp到底有多快:

I am running this on Intel Core i5 but I would imagine I would get the same results on most modern CPUs.我在 Intel Core i5 上运行它,但我想我会在大多数现代 CPU 上得到相同的结果。

Even in C, Why is memcmp so much faster than a for loop check?即使在 C 中, 为什么 memcmp 比 for 循环检查要快得多? memcmp is faster than a byte-at-a-time compare loop, because even C compilers aren't great at (or totally incapable of) auto-vectorizing search loops. memcmp比一次一个字节的比较循环快,因为即使是 C 编译器也不擅长(或完全不能)自动矢量化搜索循环。

Even without hardware SIMD support, an optimized memcmp could check 4 or 8 bytes at a time (word size / register width) even on a simple CPU without 16-byte or 32-byte SIMD.即使没有硬件 SIMD 支持,优化的memcmp也可以一次检查 4 或 8 个字节(字大小/寄存器宽度),即使在没有 16 字节或 32 字节 SIMD 的简单 CPU 上也是如此。

But most modern CPUs, and all x86-64, have SIMD instructions.但是大多数现代 CPU 以及所有 x86-64 都具有 SIMD 指令。 SSE2 is baseline for x86-64 , and available as an extension in 32-bit mode. SSE2 是 x86-64 的基线,可作为 32 位模式的扩展。

An SSE2 or AVX2 memcmp can use pcmpeqb / pmovmskb to compare 16 or 32 bytes in parallel. SSE2 或 AVX2 memcmp可以使用pcmpeqb / pmovmskb并行比较 16 或 32 个字节。 (I'm not going to go into detail about how to write memcmp in x86 asm or with C intrinsics. Google that separately, and/or look up those asm instructions in an x86 instruction-set reference. like http://felixcloutier.com/x86/index.html . See also the x86 tag wiki for asm and performance links. eg Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? has some info about single-core memory bandwidth limitations.) (我不打算详细介绍如何在 x86 asm 或 C 内部函数中编写 memcmp。单独谷歌,和/或在 x86 指令集参考中查找这些 asm 指令。比如http://felixcloutier。 com/x86/index.html 。有关 asm 和性能链接,另请参阅x86 标签 wiki 。例如, 为什么 Skylake 在单线程内存吞吐量方面比 Broadwell-E 好得多?有一些关于单核内存带宽限制的信息。)

I found an old version from 2005 of Apple's x86-64 memcmp (in AT&T syntax assembly language) on their opensource web site.我在他们的开源网站上找到了 2005 年 Apple 的 x86-64 memcmp (使用 AT&T 语法汇编语言)的旧版本 It could definitely be better;肯定会更好; for large buffers it should align one source pointer and only use movdqu for the other one, allowing movdqu then pcmpeqb with a memory operand instead of 2x movdqu , even if the strings are misaligned relative to each other.对于大缓冲区,它应该对齐一个源指针,并且只使用movdqu另一个源指针,允许movdqu然后pcmpeqb使用内存操作数而不是 2x movdqu ,即使字符串相对于彼此未对齐。 xorl $0xFFFF,%eax / jnz is also not optimal on CPUs where cmp/jcc can macro fuse but xor / jcc can't. xorl $0xFFFF,%eax / jnzcmp/jcc可以宏融合但xor / jcc不能的 CPU 上也不是最佳选择。

Unrolling to check a whole 64-byte cache line at once would also hide loop overhead.展开同时检查整个 64 字节缓存行也会隐藏循环开销。 (This is the same idea of a large chunk and then looping back over it when you find a hit). (这与大块的想法相同,然后当您找到命中时循环返回它)。 Glibc's AVX2- movbe version does this with vpand to combine compare results in the main large-buffer loop, with the final combine being a vptest that also sets flags from the result. Glibc 的 AVX2- movbe版本vpand一起在主大缓冲区循环中组合比较结果,最终组合是一个vptest ,它也从结果中设置标志。 (Smaller code-size but no fewer uops than vpand / vpmovmskb / cmp / jcc ; but no downside and maybe lower latency to reduce branch-mispredict penalties on loop exit). (比vpand / vpmovmskb / cmp / jcc更小的代码大小但不低于 uops ;但没有缺点,并且可能降低延迟以减少循环退出时的分支错误预测惩罚)。 Glibc does dynamic CPU dispatching at dynamic link time; Glibc 在动态链接时进行动态 CPU 调度; it picks this version on CPUs that support it.它会在支持它的 CPU 上选择这个版本。

Hopefully Apple's memcmp is better these days;希望 Apple 的memcmp这些天更好; I don't see source for it at all in the most recent Libc directory, though.不过,我在最新Libc目录中根本看不到它的源代码。 Hopefully they dispatch at runtime to an AVX2 version for Haswell and later CPUs.希望他们在运行时分派给 Haswell 和更高版本 CPU 的 AVX2 版本。

The LLoopOverChunks loop in the version I linked would only run at 1 iteration (16 bytes from each input) per ~2.5 cycles on Haswell;我链接的版本中的LLoopOverChunks循环仅在 Haswell 上每 ~2.5 个周期运行 1 次迭代(每个输入 16 个字节); 10 fused-domain uops. 10 个融合域微指令。 But that's still much faster than 1 byte per cycle for a naive C loop, or much much worse than that for a Python loop.但这仍然比简单的 C 循环每周期 1 个字节快得多,或者比 Python 循环差得多。

Glibc's L(loop_4x_vec): loop is 18 fused-domain uops, and can thus run at just slightly less than 32 bytes (from each input) per clock cycle, when data is hot in L1d cache. Glibc 的L(loop_4x_vec):循环是 18 个融合域微指令,因此当 L1d 高速缓存中的数据很热时,每个时钟周期(来自每个输入)的运行速度略低于 32 个字节。 Otherwise it will bottleneck on L2 bandwidth.否则它将成为 L2 带宽的瓶颈。 It could have been 17 uops if they hadn't used an extra instruction inside the loop decrementing a separate loop counter, and calculated an end-pointer outside the loop.如果他们没有在循环内使用额外的指令来递减一个单独的循环计数器,并在循环外计算一个终点指针,则可能是 17 微秒。


Finding instructions / hot spots in the Python interpreter's own code在 Python 解释器自己的代码中查找指令/热点

How could I drill down to find the C instructions and CPU instructions that my code calls?如何深入查找我的代码调用的 C 指令和 CPU 指令?

On Linux you could run perf record python ... then perf report -Mintel to see which functions the CPU was spending the most time in, and which instructions in those functions were the hottest.在 Linux 上,您可以运行perf record python ...然后perf report -Mintel查看 CPU 在哪些函数上花费的时间最多,以及这些函数中的哪些指令最热门。 You'll get results something like I posted here: Why is float() faster than int()?你会得到类似于我在这里发布的结果: 为什么 float() 比 int() 快? . . (Drill down into any function to see the actual machine instructions that ran, shown as assembly language because perf has a disassembler built in.) (深入任何函数以查看实际运行的机器指令,显示为汇编语言,因为perf内置了反汇编程序。)

For a more nuanced view that samples the call-graph on each event, see linux perf: how to interpret and find hotspots .有关对每个事件的调用图进行采样的更细致的视图,请参阅linux perf:如何解释和查找热点

(When you're looking to actually optimize a program, you want to know which function calls are expensive so you can try to avoid them in the first place. Profiling for just "self" time will find hot spots, but you won't always know which different callers caused a given loop to run most of the iterations. See Mike Dunlavey's answer on that perf question.) (当您希望实际优化程序时,您想知道哪些函数调用是昂贵的,因此您可以首先尝试避免它们。仅“自我”时间的分析会发现热点,但您不会总是知道哪些不同的调用者导致给定循环运行大部分迭代。请参阅 Mike Dunlavey 对那个 perf 问题的回答。)

But for this specific case, profiling the interpreter running a slice-compare version over big strings should hopefully find the memcmp loop where I think it will be spending most of its time.但是对于这种特定情况,分析在大字符串上运行切片比较版本的解释器应该有望找到我认为它将花费大部分时间的memcmp循环。 (Or for the char-at-a-time version, find the interpreter code that's "hot".) (或者对于 char-at-a-time 版本,找到“热”的解释器代码。)

Then you can directly see what asm instructions are in the loop.然后就可以直接看到循环中有哪些asm指令了。 From the function names, assuming your binary has any symbols, you can probably find the source.从函数名称中,假设您的二进制文件有任何符号,您可能可以找到源代码。 Or if you have a version of Python built with debug info , you can get to the source directly from profile info.或者,如果您有使用调试信息构建的 Python 版本,则可以直接从配置文件信息获取源代码。 (Not a debug build with optimization disabled, just with full symbols). (不是禁用优化的调试版本,只有完整的符号)。

This is both implementation-dependent and hardware-dependent.这既依赖于实现,也依赖于硬件。 Without knowing your target machine and specific distribution, I couldn't say for sure.在不知道您的目标机器和特定分布的情况下,我无法确定。 However, I strongly suspect that the underlying hardware, like most, has memory block instructions.但是,我强烈怀疑底层硬件和大多数硬件一样,都有内存块指令。 Among other things, this can compare a pair of arbitrarily long strings (up to addressing limits) in parallel and pipelined fashion.除其他外,这可以以并行和流水线方式比较一对任意长的字符串(直到寻址限制)。 For instance, it may compare 8-byte slices at one slice per clock cycle.例如,它可以在每个时钟周期一个片上比较 8 字节片。 This is a lot faster than fiddling with byte-level indices.这比摆弄字节级索引快得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM