简体   繁体   中英

Fastest way to remove first and last lines from a Python string

I have a python script that, for various reasons, has a variable that is a fairly large string, say 10mb long. This string contains multiple lines.

What is the fastest way to remove the first and last lines of this string? Due to the size of the string, the faster the operation, the better; there is an emphasis on speed. The program returns a slightly smaller string, sans the first and last lines.

'\n'.join(string_variable[-1].split('\n')[1:-1]) is the easiest way to do this, but it's extremely slow because the split() function copies the object in memory, and the join() copies it again.

Example string:

*** START OF DATA ***
data
data
data
*** END OF DATA ***

Extra credit: Have this program not choke if there is no data in between; this is optional, since for my case there shouldn't be a string with no data in between.

First split at '\\n' once and then check if the string at last index contains '\\n' , if yes str.rsplit at '\\n' once and pick the item at 0th index otherwise return an empty string:

def solve(s):
    s = s.split('\n', 1)[-1]
    if s.find('\n') == -1:
        return ''
    return s.rsplit('\n', 1)[0]
... 
>>> s = '''*** START OF DATA ***
data
data
data
*** END OF DATA ***'''
>>> solve(s)
'data\ndata\ndata'
>>> s = '''*** START OF DATA ***
*** END OF DATA ***'''
>>> solve(s)
''
>>> s = '\n'.join(['a'*100]*10**5)
>>> %timeit solve(s)
100 loops, best of 3: 4.49 ms per loop

Or don't split at all, find the index of '\\n' from either end and slice the string:

>>> def solve_fast(s):
    ind1 = s.find('\n')
    ind2 = s.rfind('\n')
    return s[ind1+1:ind2]
... 
>>> s = '''*** START OF DATA ***
data
data
data
*** END OF DATA ***'''
>>> solve_fast(s)
'data\ndata\ndata'
>>> s = '''*** START OF DATA ***
*** END OF DATA ***'''
>>> solve_fast(s)
''
>>> s = '\n'.join(['a'*100]*10**5)
>>> %timeit solve_fast(s)
100 loops, best of 3: 2.65 ms per loop

Consider a string s that is something like this:

s = "line1\nline2\nline3\nline4\nline5"

The following code...

s[s.find('\n')+1:s.rfind('\n')]

...produces the output:

'line2\nline3\nline4'

And, thus, is the shortest code to remove the first and the last line of a string. I do not think that the .find and .rfind methods do anything but search for a given string. Try out the speed!

Depending on the way that your use case will consume the string, the faster way to remove it may be by not removing it.

If you plan to access the lines in the string sequentially you can build a generator that skip the first and last line while yielding each line as is being consumed rather than building a new set of copies of all the lines altogether.

An ad-hoc way to avoid the first and last line is to iterate over the string without generating unnecessary copies is by keeping track of three subsequent lines and only returning the 2nd one, this way the iteration will conclude before reaching the last line without requiring to know the position of the last line break.

The following function should give you the desired output:

def split_generator(s):
  # Keep track of start/end positions for three lines
  start_prev = end_prev = 0
  start = end = 0
  start_next = end_next = 0

  nr_lines = 0

  for idx, c in enumerate(s):
    if c == '\n':
      nr_lines += 1

      start_prev = start
      end_prev = end
      start = start_next
      end = end_next
      start_next = end_next
      end_next = idx

      if nr_lines >= 3:
        yield s[(start + 1) : end]

  # Handle the case when input string does not finish on "\n"
  if s[-1] != '\n' and nr_lines >= 2:
    yield s[(start_next+1):end_next]

You cant test it with:

print("1st example")
for filtered_strs in split_generator('first\nsecond\nthird'):
  print(filtered_strs)

print("2nd example")
for filtered_strs in split_generator('first\nsecond\nthird\n'):
  print(filtered_strs)

print("3rd example")
for filtered_strs in split_generator('first\nsecond\nthird\nfourth'):
  print(filtered_strs)

print("4th example")
for filtered_strs in split_generator('first\nsecond\nthird\nfourth\n'):
  print(filtered_strs)

print("5th example")
for filtered_strs in split_generator('first\nsecond\nthird\nfourth\nfifth'):
  print(filtered_strs)

Will generates the output:

1st example
second
2nd example
second
3rd example
second
third
4th example
second
third
5th example
second
third
fourth

Note that the biggest advantage of this approach is that will only create one new line at the time and will take virtually no time to generate the first line of output (rather than wait for all the lines to be found before proceeding further) but, again, that may be useful or not depending on your use case.

Another method is to split the data at newlines and then rejoin everything but the first and last line:

>>> s = '*** START OF DATA *** \n\
... data\n\
... data\n\
... data\n\
... *** END OF DATA ***'
>>> '\n'.join(s.split('\n')[1:-1])
'data\ndata\ndata'

This works fine with no data:

>>> s = '*** START OF DATA *** \n\
... *** END OF DATA ***'
>>> '\n'.join(s.split('\n')[1:-1])
''

You could just slice minus the first and last after splitting. Simple, pythonic.

mydata = '''
data
data
data
'''

for data in mydata.split('\n')[1:-1]:
    print(data)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM