简体   繁体   English

提高 Python 3 中滑动窗口片段函数的性能

[英]Improving the performance of a sliding-window fragment function in Python 3

I have a script in Python 3.6.8 which reads through a very large text file, where each line is an ASCII string drawn from the alphabet {a,b,c,d,e,f} .我在 Python 3.6.8 中有一个脚本,它读取一个非常大的文本文件,其中每一行都是从字母{a,b,c,d,e,f}提取的 ASCII 字符串。

For each line, I have a function which fragments the string using a sliding window of size k , and then increments a fragment counter dictionary fragment_dict by 1 for each fragment seen.对于每一行,我有一个函数,它使用大小为k的滑动窗口对字符串进行分段,然后为看到的每个片段将片段计数器字典fragment_dict增加 1。

The same fragment_dict is used for the entire file, and it is initialized for all possible 5^k fragments mapping to zero.相同的fragment_dict用于整个文件,并且它被初始化为所有可能的5^k片段映射为零。

I also ignore any fragment which has the character c in it.我也忽略任何包含字符c片段。 Note that c is uncommon, and most lines will not contain it at all.请注意, c不常见,大多数行根本不包含它。

def fragment_string(mystr, fragment_dict, k):
    for i in range(len(mystr) - k + 1):

        fragment = mystr[i:i+k]
        if 'c' in fragment:
            continue

        fragment_dict[fragment] += 1

Because my file is so large, I would like to optimize the performance of the above function as much as possible.因为我的文件很大,所以想尽可能的优化上面函数的性能。 Could anyone provide any potential optimizations to make this function faster?任何人都可以提供任何潜在的优化来使这个功能更快吗?

I'm worried I may be rate limited by the speed of Python loops, in which case I would need to consider dropping down into C/Cython.我担心我可能会受到 Python 循环速度的限制,在这种情况下,我需要考虑使用 C/Cython。

Numpy may help in speeding up your code: Numpy 可能有助于加速您的代码:

x = np.array([ord(c) - ord('a') for c in mystr])
filter = np.geomspace(1, 5**(k-1), k, dtype=int)
fragment_dict = collections.Counter(np.convolve(x, filter,mode='valid'))

The idea is, represent each k length segment is a k-digit 5-ary number.这个想法是,表示每个 k 长度段是一个 k 位的 5 进制数。 Then, converting a list of 0-5 integers equivalent to the string to its 5-ary representation is equivalent to applying a convolution with [1,5,25,125,...] as filter.然后,将等效于字符串的 0-5 整数列表转换为其 5 进制表示等效于应用以 [1,5,25,125,...] 作为过滤器的卷积。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM