简体   繁体   English

如何在不在Python中创建中间列表的情况下拆分字符串并重新加入?

[英]How do I split a string and rejoin it without creating an intermediate list in Python?

Say I have something like the following: 说我有以下内容:

dest = "\n".join( [line for line in src.split("\n") if line[:1]!="#"] )

(ie strip any lines starting with # from the multi-line string src ) (即从多行字符串src删除以#开头的所有行)

src is very large, so I'm assuming .split() will create a large intermediate list. src非常大,所以我假设.split()会创建一个大的中间列表。 I can change the list comprehension to a generator expression, but is there some kind of "xsplit" I can use to only work on one line at a time? 我可以将列表理解更改为生成器表达式,但是我可以使用某种“xsplit”一次只能在一行上工作吗? Is my assumption correct? 我的假设是否正确? What's the most (memory) efficient way to handle this? 处理这个问题的最有效(内存)方法是什么?

Clarification : This arose due to my code running out of memory. 澄清 :这是由于我的代码内存不足造成的。 I know there are ways to entirely rewrite my code to work around that, but the question is about Python: Is there a version of split() (or an equivalent idiom) that behaves like a generator and hence doesn't make an additional working copy of src ? 我知道有一些方法可以完全重写我的代码来解决这个问题,但问题是关于Python:是否有一个版本的split()(或一个等效的习惯用法),它的行为类似于生成器,因此无法进行额外的工作src副本?

buffer = StringIO(src)
dest = "".join(line for line in buffer if line[:1]!="#")

Of course, this really makes the most sense if you use StringIO throughout. 当然,如果您始终使用StringIO ,这确实最有意义。 It works mostly the same as files. 它的工作方式与文件大致相同。 You can seek, read, write, iterate (as shown), etc. 您可以搜索,读取,写入,迭代(如图所示)等。

Here's a way to do a general type of split using itertools 这是使用itertools进行常规拆分的一种方法

>>> import itertools as it
>>> src="hello\n#foo\n#bar\n#baz\nworld\n"
>>> line_gen = (''.join(j) for i,j in it.groupby(src, "\n".__ne__) if i)
>>> '\n'.join(s for s in line_gen if s[0]!="#")
'hello\nworld'

groupby treats each char in src separately, so the performance probably isn't stellar, but it does avoid creating any intermediate huge data structures groupby分别处理src中的每个char,因此性能可能不是很好,但它确实避免创建任何中间的大型数据结构

Probably better to spend a few lines and make a generator 可能更好地花几行并制造发电机

>>> src="hello\n#foo\n#bar\n#baz\nworld\n"
>>>
>>> def isplit(s, t): # iterator to split string s at character t
...     i=j=0
...     while True:
...         try:
...             j = s.index(t, i)
...         except ValueError:
...             if i<len(s):
...                 yield s[i:]
...             raise StopIteration
...         yield s[i:j]
...         i = j+1
...
>>> '\n'.join(x for x in isplit(src, '\n') if x[0]!='#')
'hello\nworld'

re has a method called finditer , that could be used for this purpose too re有一个名为finditer的方法,也可以用于此目的

>>> import re
>>> src="hello\n#foo\n#bar\n#baz\nworld\n"
>>> line_gen = (m.group(1) for m in re.finditer("(.*?)(\n|$)",src))
>>> '\n'.join(s for s in line_gen if not s.startswith("#"))
'hello\nworld'

comparing the performance is an exercise for the OP to try on the real data 比较性能是OP尝试实际数据的练习

In your existing code you can change the list to a generator expression: 在现有代码中,您可以将列表更改为生成器表达式:

dest = "\n".join(line for line in src.split("\n") if line[:1]!="#")

This very small change avoids the construction of one of the two temporary lists in your code, and requires no effort on your part. 这个非常小的更改可以避免在代码中构建两个临时列表中的一个,并且不需要您付出任何努力。

A completely different approach that avoids the temporary construction of both lists is to use a regular expression: 避免临时构建两个列表的完全不同的方法是使用正则表达式:

import re
regex = re.compile('^#.*\n?', re.M)
dest = regex.sub('', src)

This will not only avoid creating temporary lists, it will also avoid creating temporary strings for each line in the input. 这不仅可以避免创建临时列表,还可以避免为输入中的每一行创建临时字符串。 Here are some performance measurements of the proposed solutions: 以下是建议解决方案的一些性能测量:

init = r'''
import re, StringIO
regex = re.compile('^#.*\n?', re.M)
src = ''.join('foo bar baz\n' for _ in range(100000))
'''

method1 = r'"\n".join([line for line in src.split("\n") if line[:1] != "#"])'
method2 = r'"\n".join(line for line in src.split("\n") if line[:1] != "#")'
method3 = 'regex.sub("", src)'
method4 = '''
buffer = StringIO.StringIO(src)
dest = "".join(line for line in buffer if line[:1] != "#")
'''

import timeit

for method in [method1, method2, method3, method4]:
    print timeit.timeit(method, init, number = 100)

Results: 结果:

9.38s   # Split then join with temporary list
 9.92s   # Split then join with generator
 8.60s   # Regular expression
64.56s   # StringIO

As you can see the regular expression is the fastest method. 正如您所看到的,正则表达式是最快的方法。

From your comments I can see that you are not actually interested in avoiding creating temporary objects. 根据您的评论,我可以看到您实际上并不想避免创建临时对象。 What you really want is to reduce the memory requirements for your program. 您真正想要的是减少程序的内存需求。 Temporary objects don't necessarily affect the memory consumption of your program as Python is good about clearing up memory quickly. 临时对象不一定会影响程序的内存消耗,因为Python很适合快速清理内存。 The problem comes from having objects that persist in memory longer than they need to, and all these methods have this problem. 问题来自于在内存中存在的对象比他们需要的时间更长,并且所有这些方法都存在这个问题。

If you are still running out of memory then I'd suggest that you shouldn't be doing this operation entirely in memory. 如果你的内存不足,我建议你不要完全在内存中执行此操作。 Instead store the input and output in files on the disk and read from them in a streaming fashion. 而是将输入和输出存储在磁盘上的文件中,并以流式方式从中读取。 This means that you read one line from the input, write a line to the output, read a line, write a line, etc. This will create lots of temporary strings but even so it will require almost no memory because you only need to handle the strings one at a time. 这意味着你从输入中读取一行,在输出中写一行,读一行,写一行等。这将创建大量的临时字符串,但即便如此,它只需要几乎没有内存,因为你只需要处理一次一个字符串。

If I understand your question about "more generic calls to split()" correctly, you could use re.finditer , like so: 如果我理解你关于“更通用的split()调用”的问题,你可以使用re.finditer ,如下所示:

output = ""

for i in re.finditer("^.*\n",input,re.M):
    i=i.group(0).strip()
    if i.startswith("#"):
        continue
    output += i + "\n"

Here you can replace the regular expression by something more sophisticated. 在这里,您可以用更复杂的东西替换正则表达式。

问题是字符串在python中是不可变的,所以没有中间存储就很难做任何事情。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM