简体   繁体   English

从 Python 字符串中删除第一行和最后一行的最快方法

[英]Fastest way to remove first and last lines from a Python string

I have a python script that, for various reasons, has a variable that is a fairly large string, say 10mb long.我有一个 python 脚本,由于各种原因,它有一个相当大的字符串变量,比如 10mb 长。 This string contains multiple lines.此字符串包含多行。

What is the fastest way to remove the first and last lines of this string?删除此字符串的第一行和最后一行的最快方法是什么? Due to the size of the string, the faster the operation, the better;由于字符串的大小,操作越快越好; there is an emphasis on speed.强调速度。 The program returns a slightly smaller string, sans the first and last lines.该程序返回一个略小的字符串,没有第一行和最后一行。

'\n'.join(string_variable[-1].split('\n')[1:-1]) is the easiest way to do this, but it's extremely slow because the split() function copies the object in memory, and the join() copies it again. '\n'.join(string_variable[-1].split('\n')[1:-1])是最简单的方法,但速度非常慢,因为 split() function 复制了 memory 中的 object ,然后 join() 再次复制它。

Example string:示例字符串:

*** START OF DATA ***
data
data
data
*** END OF DATA ***

Extra credit: Have this program not choke if there is no data in between;额外功劳:如果中间没有数据,让这个程序不会卡住; this is optional, since for my case there shouldn't be a string with no data in between.这是可选的,因为对于我的情况,不应该有一个中间没有数据的字符串。

First split at '\\n' once and then check if the string at last index contains '\\n' , if yes str.rsplit at '\\n' once and pick the item at 0th index otherwise return an empty string: 在第一次分裂'\\n'一次,然后检查是否在最后一个索引字符串中包含'\\n' ,如果是str.rsplit'\\n'一次,并挑选在第0个指标的项目,否则返回一个空字符串:

def solve(s):
    s = s.split('\n', 1)[-1]
    if s.find('\n') == -1:
        return ''
    return s.rsplit('\n', 1)[0]
... 
>>> s = '''*** START OF DATA ***
data
data
data
*** END OF DATA ***'''
>>> solve(s)
'data\ndata\ndata'
>>> s = '''*** START OF DATA ***
*** END OF DATA ***'''
>>> solve(s)
''
>>> s = '\n'.join(['a'*100]*10**5)
>>> %timeit solve(s)
100 loops, best of 3: 4.49 ms per loop

Or don't split at all, find the index of '\\n' from either end and slice the string: 或者根本不拆分,从任一端找到'\\n'的索引并切割字符串:

>>> def solve_fast(s):
    ind1 = s.find('\n')
    ind2 = s.rfind('\n')
    return s[ind1+1:ind2]
... 
>>> s = '''*** START OF DATA ***
data
data
data
*** END OF DATA ***'''
>>> solve_fast(s)
'data\ndata\ndata'
>>> s = '''*** START OF DATA ***
*** END OF DATA ***'''
>>> solve_fast(s)
''
>>> s = '\n'.join(['a'*100]*10**5)
>>> %timeit solve_fast(s)
100 loops, best of 3: 2.65 ms per loop

Consider a string s that is something like this: 考虑一个字符串s,它是这样的:

s = "line1\nline2\nline3\nline4\nline5"

The following code... 以下代码......

s[s.find('\n')+1:s.rfind('\n')]

...produces the output: ...产生输出:

'line2\nline3\nline4'

And, thus, is the shortest code to remove the first and the last line of a string. 因此,是删除字符串的第一行和最后一行的最短代码。 I do not think that the .find and .rfind methods do anything but search for a given string. 我认为.find和.rfind方法除了搜索给定的字符串之外什么都不做。 Try out the speed! 试试速度吧!

Depending on the way that your use case will consume the string, the faster way to remove it may be by not removing it. 根据您的用例使用字符串的方式,删除它的更快方法可能是不删除它。

If you plan to access the lines in the string sequentially you can build a generator that skip the first and last line while yielding each line as is being consumed rather than building a new set of copies of all the lines altogether. 如果您计划按顺序访问字符串中的行,则可以构建一个生成器,该生成器跳过第一行和最后一行,同时生成每行所消耗的行,而不是构建所有行的新副本集。

An ad-hoc way to avoid the first and last line is to iterate over the string without generating unnecessary copies is by keeping track of three subsequent lines and only returning the 2nd one, this way the iteration will conclude before reaching the last line without requiring to know the position of the last line break. 避免第一行和最后一行的特殊方法是迭代字符串而不生成不必要的副本是通过跟踪三个后续行并仅返回第二行,这样迭代将在到达最后一行之前结束而不需要知道最后一次换行的位置。

The following function should give you the desired output: 以下函数应该为您提供所需的输出:

def split_generator(s):
  # Keep track of start/end positions for three lines
  start_prev = end_prev = 0
  start = end = 0
  start_next = end_next = 0

  nr_lines = 0

  for idx, c in enumerate(s):
    if c == '\n':
      nr_lines += 1

      start_prev = start
      end_prev = end
      start = start_next
      end = end_next
      start_next = end_next
      end_next = idx

      if nr_lines >= 3:
        yield s[(start + 1) : end]

  # Handle the case when input string does not finish on "\n"
  if s[-1] != '\n' and nr_lines >= 2:
    yield s[(start_next+1):end_next]

You cant test it with: 你不能测试它:

print("1st example")
for filtered_strs in split_generator('first\nsecond\nthird'):
  print(filtered_strs)

print("2nd example")
for filtered_strs in split_generator('first\nsecond\nthird\n'):
  print(filtered_strs)

print("3rd example")
for filtered_strs in split_generator('first\nsecond\nthird\nfourth'):
  print(filtered_strs)

print("4th example")
for filtered_strs in split_generator('first\nsecond\nthird\nfourth\n'):
  print(filtered_strs)

print("5th example")
for filtered_strs in split_generator('first\nsecond\nthird\nfourth\nfifth'):
  print(filtered_strs)

Will generates the output: 将生成输出:

1st example
second
2nd example
second
3rd example
second
third
4th example
second
third
5th example
second
third
fourth

Note that the biggest advantage of this approach is that will only create one new line at the time and will take virtually no time to generate the first line of output (rather than wait for all the lines to be found before proceeding further) but, again, that may be useful or not depending on your use case. 请注意,这种方法的最大优点是,当时只会创建一个新行,并且几乎没有时间生成第一行输出(而不是等待所有行在找到之前再继续)但是,再次,根据您的使用情况,这可能有用或不有用。

Another method is to split the data at newlines and then rejoin everything but the first and last line: 另一种方法是将数据拆分为换行符,然后重新加入除第一行和最后一行之外的所有内容:

>>> s = '*** START OF DATA *** \n\
... data\n\
... data\n\
... data\n\
... *** END OF DATA ***'
>>> '\n'.join(s.split('\n')[1:-1])
'data\ndata\ndata'

This works fine with no data: 没有数据这很好用:

>>> s = '*** START OF DATA *** \n\
... *** END OF DATA ***'
>>> '\n'.join(s.split('\n')[1:-1])
''

You could just slice minus the first and last after splitting.你可以在分割后减去第一个和最后一个。 Simple, pythonic.简单,pythonic。

mydata = '''
data
data
data
'''

for data in mydata.split('\n')[1:-1]:
    print(data)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如果字符串中的第一个或最后一个字符是 $ 在 python 中删除它 - If first or last character from string is $ remove it in python Python:有没有一种方法可以查找和删除字符串中字符的第一个和最后一个出现的位置? - Python: Is there a way to find and remove the first and last occurrence of a character in a string? Python删除字符串中多个空格的最快方法 - Python fastest way to remove multiple spaces in a string 如何从python中的字符串中删除最后2行? - How can I remove last 2 lines from a string in python? 获取python迭代器的第一个和最后一个元素的最快方法 - Fastest way to get the first and last element of a python iterator 如何在Python中删除字符串的第一部分和最后一部分? - How to remove the first and last portion of a string in Python? Python从字符串中隔开的字母中删除单个空格的最快方法 - Python fastest way to remove single spaces from spaced out letters in string 替换字典中文件行的最快方法是什么? - python - What is the fastest way to replace lines of a file from a dictionary? 从 python 列表中删除第一个和最后一个引号 - remove first and last quote from python list 从字符串中删除不安全字符但保留Unicode字符的最快方法? - Fastest way to remove unsafe chars from string but keep unicode characters?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM