简体   繁体   English

替换字符串中一组字符的最快方法

[英]Fastest way to substitute a set of characters in a string

I'm working with a string of bytes (which can be anywhere between 10kb and 3MB) and I need to filter out approximately 16 bytes (replacing them with other bytes) 我正在使用一串字节(可以是10kb到3MB之间的任何地方),我需要过滤掉大约16个字节(用其他字节替换它们)

At the moment I have a function a bit like this.. 目前我的功能有点像这样..

BYTE_REPLACE = {
  52: 7, # first number is the byte I want to replace
  53: 12, # while the second number is the byte I want to replace it WITH
}
def filter(st):
  for b in BYTE_REPLACE:
    st = st.replace(chr(b),chr(BYTE_REPLACE[b]))
  return st

(Byte list paraphrased for the sake of this question) (为了这个问题而改述的字节列表)

Using map resulted in an execution time of ~.33s, while this results in a 10x faster time of ~.03s (Both performed on a HUGE string, larger than 1.5MB compressed). 使用map导致执行时间为〜.33s,而这导致〜。03s的快10倍(两者都在巨大的字符串上执行,压缩大于1.5MB)。

While any performance gains would be considerably negligible, is there a better way of doing this? 虽然任何性能提升都可以忽略不计,但还有更好的方法吗?

(I am aware that it would be much more optimal to store the filtered string. This isn't an option, though. I'm fooling with a Minecraft Classic server's level format and have to filter out bytes that certain clients don't support) (我知道存储过滤后的字符串会更加优化。但这不是一个选项。我在使用Minecraft Classic服务器的级别格式,并且必须过滤掉某些客户端不支持的字节)

Use str.translate : 使用str.translate

Python 3.x Python 3.x

def subs(st):
    return st.translate(BYTE_REPLACE)

Example usage: 用法示例:

>>> subs('4567')
'\x07\x0c67'

Python 2.x Python 2.x

str.translate (Python 2) str.translate (Python 2)

import string
k, v = zip(*BYTE_REPLACE.iteritems())
k, v = ''.join(map(chr, k)), ''.join(map(chr, v))
tbl = string.maketrans(k, v)
def subs(st):
    return st.translate(tbl)

Look up the translate() method on strings. 在字符串上查找translate()方法。 That allows you to do any number of 1-byte transformations in a single pass over the string. 这允许您在字符串上的一次传递中进行任意数量的1字节转换。 Use the string.maketrans() function to build the translation table. 使用string.maketrans()函数构建转换表。 If you usually have 16 pairs, this should run about 16 times faster than doing 1-byte replacements 16 times. 如果你通常有16对,这比16字节替换1字节要快16倍。

In your current design, String.replace() is being called on the string n times, for each pair. 在您当前的设计中, String.replace()在字符串上被调用n次,对于每对。 While its most likely an efficient algorithm, on a 3MB string it might slow down. 虽然它最有可能是一种有效的算法,但在3MB字符串上它可能会变慢。

If the string is already contained in memory by the time this function is called, I'd wager that the most efficient way would be: 如果在调用此函数时字符串已经包含在内存中,我会打赌最有效的方法是:

BYTE_REPLACE = {
  52: 7, # first number is the byte I want to replace
  53: 12, # while the second number is the byte I want to replace it WITH
}
def filter(st):
  st = list(st) # Convert string to list to edit in place :/
  for i,s in enumerate(st): #iterate through list
    if ord(s) in BYTE_REPLACE.keys():
        s[i]=chr(BYTE_REPLACE[ord(b)])
  return "".join(st) #return string

There is a large operation to create a new list at the start, and another to convert back to a string, but since python strings are immutable in your design a new string is made for each replacement. 有一个很大的操作在开始时创建一个新列表,另一个转换回一个字符串,但由于python字符串在你的设计中是不可变的,因此每个替换都会生成一个新的字符串。

This is all based on conjecture, and could be wrong. 这完全基于猜想,可能是错误的。 You'd want to test it with your actual data. 您需要使用实际数据进行测试。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM