简体   繁体   English

如何加速这个Python代码?

[英]How to speed up this Python code?

I've got the following tiny Python method that is by far the performance hotspot (according to my profiler, >95% of execution time is spent here) in a much larger program: 我有一个以下微小的Python方法,它是迄今为止性能热点(根据我的分析器,在这里花费了大约95%的执行时间)在一个更大的程序中:

def topScore(self, seq):
    ret = -1e9999
    logProbs = self.logProbs  # save indirection
    l = len(logProbs)
    for i in xrange(len(seq) - l + 1):
        score = 0.0
        for j in xrange(l):
            score += logProbs[j][seq[j + i]]
        ret = max(ret, score)

    return ret

The code is being run in the Jython implementation of Python, not CPython, if that matters. 代码是在Python的Jython实现中运行,而不是CPython,如果这很重要的话。 seq is a DNA sequence string, on the order of 1,000 elements. seq是DNA序列串,大约1,000个元素。 logProbs is a list of dictionaries, one for each position. logProbs是一个字典列表,每个位置一个。 The goal is to find the maximum score of any length l (on the order of 10-20 elements) subsequence of seq . 目标是找到seq的任何长度l (大约10-20个元素)子序列的最大分数。

I realize all this looping is inefficient due to interpretation overhead and would be a heck of a lot faster in a statically compiled/JIT'd language. 我意识到所有这些循环由于解释开销而效率低下,并且在静态编译/ JIT语言中会更快。 However, I'm not willing to switch languages. 但是,我不愿意切换语言。 First, I need a JVM language for the libraries I'm using, and this kind of constrains my choices. 首先,我需要一个JVM语言用于我正在使用的库,这种约束我的选择。 Secondly, I don't want to translate this code wholesale into a lower-level JVM language. 其次,我不想将此代码批量转换为较低级别的JVM语言。 However, I'm willing to rewrite this hotspot in something else if necessary, though I have no clue how to interface it or what the overhead would be. 但是,如果有必要,我愿意用其他东西重写这个热点,虽然我不知道如何连接它或者开销是多少。

In addition to the single-threaded slowness of this method, I also can't get the program to scale much past 4 CPUs in terms of parallelization. 除了这种方法的单线程缓慢之外,我还无法让程序在并行化方面超过4个CPU。 Given that it spends almost all its time in the 10-line hotspot I've posted, I can't figure out what the bottleneck could be here. 鉴于它几乎所有的时间都花在我发布的10行热点上,我无法弄清楚这里的瓶颈是什么。

The reason it is slow is because it is O(N*N) 它缓慢的原因是因为它是O(N * N)

The maximum subsequence algorithm may help you improve this 最大子序列算法可以帮助您改进这一点

if topScore is called repeatedly for same seq you could memoize its value. 如果针对相同的seq重复调用topScore则可以memoize其值。

Eg http://code.activestate.com/recipes/52201/ 例如http://code.activestate.com/recipes/52201/

i don't have any idea what i'm doing but maybe this can help speed up your algo: 我不知道我在做什么,但也许这可以帮助加快你的算法:

ret = -1e9999
logProbs = self.logProbs  # save indirection
l = len(logProbs)

scores = collections.defaultdict(int)

for j in xrange(l):
    prob = logProbs[j]
    for i in xrange(len(seq) - l + 1):
        scores[i] += prob[seq[j + i]]


ret = max(ret, max(scores.values()))

那么在for i循环之外预先计算xrange(l)呢?

Nothing jumps out as being slow. 没有什么能像慢一样跳出来。 I might rewrite the inner loop like this: 我可能会像这样重写内部循环:

score = sum(logProbs[j][seq[j+i]] for j in xrange(l))

or even: 甚至:

seqmatch = zip(seq[i:i+l], logProbs)
score = sum(posscores[base] for base, posscores in seqmatch)

but I don't know that either would save much time. 但我不知道要么节省很多时间。

It might be marginally quicker to store DNA bases as integers 0-3, and look up the scores from a tuple instead of a dictionary. 将DNA碱基存储为整数0-3可能稍微快一些,并从元组而不是字典中查找分数。 There'll be a performance hit on translating letters to numbers, but that only has to be done once. 将字母翻译成数字会有性能影响,但只需要进行一次。

Definitely use numpy and store logProbs as a 2D array instead of a list of dictionaries. 绝对使用numpy并将logProbs存储为2D数组而不是字典列表。 Also store seq as a 1D array of (short) integers as suggested above. 如上所述,还将seq存储为(短)整数的一维数组。 This will help if you don't have to do these conversions every time you call the function (doing these conversions inside the function won't save you much). 如果您不必在每次调用函数时都进行这些转换,这将有所帮助(在函数内部进行这些转换不会为您节省太多)。 You can them eliminate the second loop: 你可以消除第二个循环:

import numpy as np
...
print np.shape(self.logProbs) # (20, 4)
print np.shape(seq) # (1000,)
...
def topScore(self, seq):
ret = -1e9999
logProbs = self.logProbs  # save indirection
l = len(logProbs)
for i in xrange(len(seq) - l + 1):
    score = np.sum(logProbs[:,seq[i:i+l]])
    ret = max(ret, score)

return ret

What you do after that depends on which of these 2 data elements changes the most often: 之后你做了什么取决于这两个数据元素中哪一个最常变化:

If logProbs generally stays the same and you want to run many DNA sequences through it, then consider stacking your DNA sequences as a 2D array. 如果logProbs通常保持不变并且您希望通过它运行许多DNA序列,那么请考虑将DNA序列堆叠为2D阵列。 numpy can loop through the 2D array very quickly so if you have 200 DNA sequences to process, it will only take a little longer than a single. numpy可以非常快速地遍历2D阵列,因此如果您要处理200个DNA序列,它只需要比单个序列长一点。

Finally, if you really need speed up, use scipy.weave. 最后,如果你真的需要加速,请使用scipy.weave。 This is a very easy way to write a few lines of fast C to accelerate you loops. 这是编写几行快速C以加速循环的一种非常简单的方法。 However, I recommend scipy >0.8. 但是,我建议scipy> 0.8。

You can try hoisting more than just self.logProbs outside the loops: 您可以尝试在循环外提升不仅仅是self.logProbs:

def topScore(self, seq):
    ret = -1e9999
    logProbs = self.logProbs  # save indirection
    l = len(logProbs)
    lrange = range(l)
    for i in xrange(len(seq) - l + 1):
        score = 0.0
        for j in lrange:
            score += logProbs[j][seq[j + i]]
        if score > ret: ret = score # avoid lookup and function call

    return ret

I doubt it will make a significant difference, but you could try changing: 我怀疑它会产生重大影响,但你可以尝试改变:

  for j in xrange(l):
        score += logProbs[j][seq[j + i]]

to

  for j,lP in enumerate(logProbs):
        score += lP[seq[j + i]]

or even hoisting that enumeration outside the seq loop. 甚至在seq循环外提升枚举。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM