简体   繁体   English

一种更好的方式来读取文本的子字符串而无需循环/ python

[英]a better way to read in the substrings of a text without loop / python

I am reading lines from a file and then traversing each overlapping substring of k size in a loop, then process these strings. 我正在从文件中读取行,然后在循环中遍历k个大小的每个重叠子字符串,然后处理这些字符串。 What would be a better (more efficient and elegant) way to read in the substrings? 哪种更好(更有效,更优雅)的方式读取子字符串? How can I make a list without the loop? 如何制作没有循环的列表?

for line in lines[1::4]:
    startIdx = 0
    while startIdx + k <= len(line):
        substring = line[startIdx:(startIdx+k)]
        countFromSb[substring] = countFromSb.get(substring, 0) + 1
        startIdx += 1
    linesProcessed += 1

It can be made more elegant by using a collections.Counter instance 通过使用collections.Counter实例可以使其更加优雅

countFromSb = Counter()
# ...
n = -1
for n, line in enumerate(lines[1::4]):
    countFromSb.update(line[i:i+k] for i in range(1+len(line)-k))
lines_processed = n + 1

You can't iterate over the fixed-size slices of a sequence any faster than O(N), so your current approach is already as efficient as it gets. 您不能以比O(N)更快的速度迭代序列的固定大小的切片,因此您当前的方法已经足够高效。

In terms of elegance, you could abstract the iteration into its own function, which will keep your current scope less cluttered with one letter variable names: 在优雅方面,您可以将迭代抽象到其自己的函数中,这将使您的当前作用域减少一个字母变量名的混乱:

def iter_slices(s, size):
    for i in range(len(s)-size+1):
        yield s[i:i+size]

for line in lines[1::4]:
    for substring in iter_slices(line, k):
        countFromSb[substring] = countFromSb.get(substring, 0) + 1
    linesProcessed += 1

This can also be combined with Gribouillis' suggestion to use a Counter, eliminating the for blocks entirely: 这也可以与Gribouillis的使用Counter的建议相结合,完全消除了for块:

countFromSb = Counter(substring for line in lines[1::4] for substring in iter_slices(line, k))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM