简体   繁体   中英

Python Regex remove comments or numbers in brackets

I am trying to remove line numbers and comments using regex, but it does not work just yet:

import re
string = """(1) At what time.!? [asdf] School-
(2) bus. So late, already.!? [ghjk]"""

#res = re.sub(r"[\(\[].*?[\)\]]", "", string)

res = re.sub("(\d+) ","", res)
res = re.sub("[.*]","", res)
res = re.sub(r"-\s","", res)
res = re.sub(r"[^\w\säüöß]","", res)
res = re.sub("-\n","", res)
print(res.split())

So I was trying to remove anything in brackets () and [] with my #commented line, but then I am stuck with a whitespace starting of each line. Then I decided to split it up and came up the the five re.sub methods.

Result should be like this:

['At', 'what', 'time', 'Schoolbus', 'So', 'late', 'already']

I am stuck with the Linenumbers not being removed, although they are in () and should be gone. Which then causes my res.sub() for connecting words with "-" from school- bus to schoolbus to not work aswell.

You may use this sub + findall solution:

import re

string = """(1) At what time.!? [asdf] School-
(2) bus. So late, already.!? [ghjk]"""

print (re.findall(r'\b\w+(?:-\w+)*', re.sub(r'(\([^)]*\)|\[[^]]*\]|-)\s*', '', string)))

Output:

['At', 'what', 'time', 'Schoolbus', 'So', 'late', 'already']

Details:

  • re.sub(r'(\([^)]*\)|\[[^]]*\]|-)\s*', '', string) : Removes all (...) and [...] or - strings followed by 0 or more spaces
  • \b\w+ : Matches 1+ word characters starting with a word boundary

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM