优化python中的字符串替换

Question

我有一个简单的问题。 我有一些文本文件，其中的单词已在行尾分割（连字符号）。 像这样的东西：

toward an emotionless evalu-
ation of objectively gained

我想摆脱连字并再次加入这些词。 这可以使用replace()函数简单快速地完成。 但是在某些情况下，连字符后面会有一些额外的换行符。 像这样：

end up as a first rate con-


tribution, but that was not

我没有堆积几次调用replace() ，而是切换到正则表达式并使用了re.sub('\\-\\n+', '', text) ：

def replace_hyphens(text):
    return re.sub('\-\n+', '', text)

这很有效，但我想知道如何用直接在Python中编码的函数实现相同的结果。 这就是我想出的：

def join_hyphens(text):
    processed = ''
    i = 0
    while i < len(text):
        if text[i] == '-':
            while text[i+1] == '\n':
                i += 1
            i += 1
        processed += text[i]
        i += 1
    return processed

但当然，与正则表达式相比，表现非常糟糕。 如果我在相当长的字符串上超过100次迭代计时，结果就是这里。

join_hyphens done in 2.398ms
replace_hyphens done in 0.021ms

在使用本机Python代码的同时提高性能的最佳方法是什么？

编辑：按照建议切换到列表可以显着提高性能，但与正则表达式相比仍然表现不佳：

def join_hyphens(text):
    processed = []
    i = 0
    while i < len(text):
        if text[i] == '-':
            while text[i+1] == '\n':
                i += 1
            i += 1
        processed.append(text[i])
        i += 1
    return ''.join(processed)

得到：

    join_hyphens done in 1.769ms
    replace_hyphens done in 0.020ms

Answer 1

processed += text[i]

processed变得很大时非常慢。 字符串是不可变的，因此就地添加只是一种幻觉。 它没有到位。

有几种选择，一种简单的方法是构建一个列表然后使用str.join ：

def join_hyphens(text):
    processed = []
    i = 0
    while i < len(text):
        if text[i] == '-':
            while text[i+1] == '\n':
                i += 1
            i += 1
        processed.append(text[i])
        i += 1
    return "".join(processed)

join预先计算字符串所需的空间，分配（一次性）并加入字符串。 一切都是使用python的编译内核完成的，所以速度非常快。

（遗憾的是，你的代码的原生python循环使程序变慢，正则表达式使用编译代码而没有本机python循环，这解释了它更快str.join在其他上下文中非常有用，但是当前的问题通过其他几个答案）

Answer 2

派对有点晚了但无论如何...... Python标准库中的所有东西都被认为是原生Python ，因为它应该可以在任何Python系统上使用，因此它还包括re模块。

但是，如果您坚持单独使用Python，而不是逐个遍历字符，则可以使用本机文本搜索来跳过大量文本。 这应该会提高性能，在某些情况下甚至可以击败regex 。 当然，通过"".join()字符串连接也是更优先的，正如其他人所说：

def join_hyphens(text):
    pieces = []  # a simple chunk buffer store
    head = 0  # our current search head
    finder = text.find  # optimize lookup for str.find
    add_piece = pieces.append  # optimize lookup for list.append
    while True:
        index = finder("-\n", head)  # find the next hyphen
        if index >= 0:  # check if a hyphen was found
            add_piece(text[head:index])  # add the current chunk
            head = index + 2  # move the search head for after the find
            while text[head] == "\n":  # skip new line characters
                head += 1
        else:
            add_piece(text[head:])  # add the last chunk
            break
    return "".join(pieces)  # join the chunks and return them

并测试它：

text = """end up as a first rate con-


tribution, but that was not"""

print(join_hyphens(text))  # end up as a first rate contribution, but that was not

Answer 3

用+ =构建一个字符串使它成为O（n ** 2）。 制作一个片段列表并以O（n）加入它们，并且对于任何实质文本都更快。

def join_hyphens(text):
    processed = []
    i = 0
    while i < len(text):
        if text[i] == '-':
            while text[i+1] == '\n':
                i += 1
            i += 1
        processed.append(text[i])
        i += 1
    return ''.join(processed)

编辑：没有样品，未经测试。 但这是一个标准的习语。 EDIT2：更正了语法错误

Answer 4

尝试：

def join_hyphens(text):
    while "-\n\n" in text:
        text = text.replace("-\n\n", "-\n")

    return text.replace("-\n", "")

这仍将创建多个字符串，但不如你的方法，因为它创建了一个字符串的副本，每次最大出现 - \\ n \\ n + 1以从中删除所有 - \\ n。

Answer 5

另外一个选项：

def join_hyphens(text):
    return "\n".join([t for t in text.split("\n") if t]).replace("-\n", "")

拆分\\n上的文本，然后使用列表推导删除空行。 然后使用\\n将其重新连接在一起并进行替换。

这很快，但它会产生删除所有空白行的副作用。

更新：计时结果

首先构建一个随机数据集：

import numpy as np
p1 = 0.25
p2 = 0.25
NLines = 100
text = "\n".join(
    [
        " ".join(
            [
                "".join(
                    [
                        np.random.choice(list(string.letters)) 
                        for _ in range(np.random.randint(1,10))
                    ]
                ) 
                for _ in range(np.random.randint(1,10))
            ]
        )
        + ("-" if np.random.random() < p1 else "") 
        + "".join(["\n" for _ in range(np.random.randint(1,4)) if np.random.random() < p2])
        for _ in range(NLines)
    ]
) + "this is the last line"

结果：

%%timeit
replace_hyphens(text)
#100000 loops, best of 3: 8.1 µs per loop

%%timeit
join_hyphens(text)
#1000 loops, best of 3: 601 µs per loop

%%timeit
join_hyphens_pault(text)
#100000 loops, best of 3: 17.7 µs per loop

%%timeit
join_hyphens_terry(text)
#1000 loops, best of 3: 661 µs per loop

%%timeit
join_hyphens_jean(text)
#1000 loops, best of 3: 653 µs per loop

%%timeit
join_hyphens_patrick(text)
#100000 loops, best of 3: 10.1 µs per loop

%%timeit
join_hyphens_zwer(text)
#100000 loops, best of 3: 14.4 µs per loop

Answer 6

我认为部分糟糕的表现是你不断创建新的字符串，因为字符串在python中是不可变的。 所以，当你这样做

processed += text[i]

分配一个processed + 1大小为processed + 1的新字符串。 您希望避免分配更快，因此您将字符串转换为char列表，并将其变异。 理想情况下，您需要计算所需的空间并预填输出列表以避免不必要的分配。

优化python中的字符串替换

问题描述

6 个解决方案

解决方案1
5 2018-04-04 19:18:47

解决方案2
5 已采纳 2018-04-04 19:33:52

解决方案3
4 2018-04-04 19:18:16

解决方案4
3 2018-04-04 19:19:00

解决方案5
3 2018-04-04 19:22:51

解决方案6
1 2018-04-04 19:18:41

优化python中的字符串替换

问题描述

6 个解决方案

解决方案1 5 2018-04-04 19:18:47

解决方案2 5 已采纳 2018-04-04 19:33:52

解决方案3 4 2018-04-04 19:18:16

解决方案4 3 2018-04-04 19:19:00

解决方案5 3 2018-04-04 19:22:51

解决方案6 1 2018-04-04 19:18:41

解决方案1
5 2018-04-04 19:18:47

解决方案2
5 已采纳 2018-04-04 19:33:52

解决方案3
4 2018-04-04 19:18:16

解决方案4
3 2018-04-04 19:19:00

解决方案5
3 2018-04-04 19:22:51

解决方案6
1 2018-04-04 19:18:41