从 Python 字符串中删除第一行和最后一行的最快方法

Question

I have a python script that, for various reasons, has a variable that is a fairly large string, say 10mb long.我有一个 python 脚本，由于各种原因，它有一个相当大的字符串变量，比如 10mb 长。 This string contains multiple lines.此字符串包含多行。

What is the fastest way to remove the first and last lines of this string?删除此字符串的第一行和最后一行的最快方法是什么？ Due to the size of the string, the faster the operation, the better;由于字符串的大小，操作越快越好； there is an emphasis on speed.强调速度。 The program returns a slightly smaller string, sans the first and last lines.该程序返回一个略小的字符串，没有第一行和最后一行。

'\n'.join(string_variable[-1].split('\n')[1:-1]) is the easiest way to do this, but it's extremely slow because the split() function copies the object in memory, and the join() copies it again. '\n'.join(string_variable[-1].split('\n')[1:-1])是最简单的方法，但速度非常慢，因为 split() function 复制了 memory 中的 object ，然后 join() 再次复制它。

Example string:示例字符串：

*** START OF DATA ***
data
data
data
*** END OF DATA ***

Extra credit: Have this program not choke if there is no data in between;额外功劳：如果中间没有数据，让这个程序不会卡住； this is optional, since for my case there shouldn't be a string with no data in between.这是可选的，因为对于我的情况，不应该有一个中间没有数据的字符串。

Answer 1

First split at '\\n' once and then check if the string at last index contains '\\n' , if yes str.rsplit at '\\n' once and pick the item at 0th index otherwise return an empty string: 在第一次分裂'\\n'一次，然后检查是否在最后一个索引字符串中包含'\\n' ，如果是str.rsplit在'\\n'一次，并挑选在第0个指标的项目，否则返回一个空字符串：

def solve(s):
    s = s.split('\n', 1)[-1]
    if s.find('\n') == -1:
        return ''
    return s.rsplit('\n', 1)[0]
... 
>>> s = '''*** START OF DATA ***
data
data
data
*** END OF DATA ***'''
>>> solve(s)
'data\ndata\ndata'
>>> s = '''*** START OF DATA ***
*** END OF DATA ***'''
>>> solve(s)
''
>>> s = '\n'.join(['a'*100]*10**5)
>>> %timeit solve(s)
100 loops, best of 3: 4.49 ms per loop

Or don't split at all, find the index of '\\n' from either end and slice the string: 或者根本不拆分，从任一端找到'\\n'的索引并切割字符串：

>>> def solve_fast(s):
    ind1 = s.find('\n')
    ind2 = s.rfind('\n')
    return s[ind1+1:ind2]
... 
>>> s = '''*** START OF DATA ***
data
data
data
*** END OF DATA ***'''
>>> solve_fast(s)
'data\ndata\ndata'
>>> s = '''*** START OF DATA ***
*** END OF DATA ***'''
>>> solve_fast(s)
''
>>> s = '\n'.join(['a'*100]*10**5)
>>> %timeit solve_fast(s)
100 loops, best of 3: 2.65 ms per loop

Answer 2

Consider a string s that is something like this: 考虑一个字符串s，它是这样的：

s = "line1\nline2\nline3\nline4\nline5"

The following code... 以下代码......

s[s.find('\n')+1:s.rfind('\n')]

...produces the output: ...产生输出：

'line2\nline3\nline4'

And, thus, is the shortest code to remove the first and the last line of a string. 因此，是删除字符串的第一行和最后一行的最短代码。 I do not think that the .find and .rfind methods do anything but search for a given string. 我认为.find和.rfind方法除了搜索给定的字符串之外什么都不做。 Try out the speed! 试试速度吧！

Answer 3

Depending on the way that your use case will consume the string, the faster way to remove it may be by not removing it. 根据您的用例使用字符串的方式，删除它的更快方法可能是不删除它。

If you plan to access the lines in the string sequentially you can build a generator that skip the first and last line while yielding each line as is being consumed rather than building a new set of copies of all the lines altogether. 如果您计划按顺序访问字符串中的行，则可以构建一个生成器，该生成器跳过第一行和最后一行，同时生成每行所消耗的行，而不是构建所有行的新副本集。

An ad-hoc way to avoid the first and last line is to iterate over the string without generating unnecessary copies is by keeping track of three subsequent lines and only returning the 2nd one, this way the iteration will conclude before reaching the last line without requiring to know the position of the last line break. 避免第一行和最后一行的特殊方法是迭代字符串而不生成不必要的副本是通过跟踪三个后续行并仅返回第二行，这样迭代将在到达最后一行之前结束而不需要知道最后一次换行的位置。

The following function should give you the desired output: 以下函数应该为您提供所需的输出：

def split_generator(s):
  # Keep track of start/end positions for three lines
  start_prev = end_prev = 0
  start = end = 0
  start_next = end_next = 0

  nr_lines = 0

  for idx, c in enumerate(s):
    if c == '\n':
      nr_lines += 1

      start_prev = start
      end_prev = end
      start = start_next
      end = end_next
      start_next = end_next
      end_next = idx

      if nr_lines >= 3:
        yield s[(start + 1) : end]

  # Handle the case when input string does not finish on "\n"
  if s[-1] != '\n' and nr_lines >= 2:
    yield s[(start_next+1):end_next]

You cant test it with: 你不能测试它：

print("1st example")
for filtered_strs in split_generator('first\nsecond\nthird'):
  print(filtered_strs)

print("2nd example")
for filtered_strs in split_generator('first\nsecond\nthird\n'):
  print(filtered_strs)

print("3rd example")
for filtered_strs in split_generator('first\nsecond\nthird\nfourth'):
  print(filtered_strs)

print("4th example")
for filtered_strs in split_generator('first\nsecond\nthird\nfourth\n'):
  print(filtered_strs)

print("5th example")
for filtered_strs in split_generator('first\nsecond\nthird\nfourth\nfifth'):
  print(filtered_strs)

Will generates the output: 将生成输出：

1st example
second
2nd example
second
3rd example
second
third
4th example
second
third
5th example
second
third
fourth

Note that the biggest advantage of this approach is that will only create one new line at the time and will take virtually no time to generate the first line of output (rather than wait for all the lines to be found before proceeding further) but, again, that may be useful or not depending on your use case. 请注意，这种方法的最大优点是，当时只会创建一个新行，并且几乎没有时间生成第一行输出（而不是等待所有行在找到之前再继续）但是，再次，根据您的使用情况，这可能有用或不有用。

Answer 4

Another method is to split the data at newlines and then rejoin everything but the first and last line: 另一种方法是将数据拆分为换行符，然后重新加入除第一行和最后一行之外的所有内容：

>>> s = '*** START OF DATA *** \n\
... data\n\
... data\n\
... data\n\
... *** END OF DATA ***'
>>> '\n'.join(s.split('\n')[1:-1])
'data\ndata\ndata'

This works fine with no data: 没有数据这很好用：

>>> s = '*** START OF DATA *** \n\
... *** END OF DATA ***'
>>> '\n'.join(s.split('\n')[1:-1])
''

Answer 5

You could just slice minus the first and last after splitting.你可以在分割后减去第一个和最后一个。 Simple, pythonic.简单，pythonic。

mydata = '''
data
data
data
'''

for data in mydata.split('\n')[1:-1]:
    print(data)

从 Python 字符串中删除第一行和最后一行的最快方法

问题描述

5 个解决方案

解决方案1
9 2015-01-25 07:57:59

解决方案2
7 2016-01-07 10:48:23

解决方案3
0 2015-01-26 09:09:24

解决方案4
0 2016-06-09 21:36:56

解决方案5
0 2023-01-13 07:24:43

从 Python 字符串中删除第一行和最后一行的最快方法

问题描述

5 个解决方案

解决方案1 9 2015-01-25 07:57:59

解决方案2 7 2016-01-07 10:48:23

解决方案3 0 2015-01-26 09:09:24

解决方案4 0 2016-06-09 21:36:56

解决方案5 0 2023-01-13 07:24:43

解决方案1
9 2015-01-25 07:57:59

解决方案2
7 2016-01-07 10:48:23

解决方案3
0 2015-01-26 09:09:24

解决方案4
0 2016-06-09 21:36:56

解决方案5
0 2023-01-13 07:24:43