简体   繁体   中英

Python regex: remove short lines

I have a string with multiple newline symbols:

text = 'foo\na\nb\n$\n\nxz\nbar'

I want to remove the lines that are shorter than 3 symbols. The desired output is

'foo\n\nbar'

I tried

re.sub(r'(\n([\s\S]{0,2})\n)+', '\nX\n', text,  flags= re.S)

but this matches only some subset of the string and the result is

'foo\nX\nb\nX\nxz\nbar'

I need somehow to do greedy search and replace the longest string matching the pattern.

re.S makes . match everything including newline, and you don't want that. Instead use re.M so ^ matches beginning of string and after newline, and use:

>>> import re
>>> text = 'foo\na\nb\n$\n\nxz\nbar'
>>> re.findall('(?m)^.{0,2}\n',text)
['a\n', 'b\n', '$\n', '\n', 'xz\n']
>>> re.sub('(?m)^.{0,2}\n','',text)
'foo\nbar'

That's "from start of a line, match 0-2 non-newline characters, followed by a newline".

I noticed your desired output has a \\n\\n in it. If that isn't a mistake use .{1,2} if blank lines are to be left in.

You might also want to allow the final line of the string to have an optional terminating newline, for example:

>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nbar') # 3 symbols at end, no newline
'foo\nbar'
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nbar\n') # same, with newline
'foo\nbar\n'
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nba\n') # <3 symbols, newline
'foo\n'
>>> re.sub('(?m)^.{0,2}$\n?','','foo\na\nb\n$\n\nxz\nba')  # < 3 symbols, no newline
'foo\n'

Perhaps you can use re.findall instead:

text = 'foo\na\nb\n$\n\nxz\nbar'

import re

print (repr("".join(re.findall(r"\n?\w{3,}\n?",text))))

#
'foo\n\nbar'

You can use this regex, which looks for any set of less than 3 non-newline characters following either start-of-string or a newline and followed by a newline or end-of-string, and replace it with an empty string:

(^|\n)[^\n]{0,2}(?=\n|$)

In python:

import re
text = 'foo\na\nb\n$\n\nxz\nbar'
print(re.sub(r'(^|\n)[^\n]{0,2}(?=\n|$)', '', text))

Output

foo
bar

Demo on rextester

There's no need to use regex for this.

raw_str = 'foo\na\nb\n$\n\nxz\nbar'

str_res = '\n'.join([curr for curr in raw_str.splitlines() if len(curr) >= 3])

print(str_res) :

foo
bar

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM