I am trying to remove line numbers and comments using regex, but it does not work just yet:
import re
string = """(1) At what time.!? [asdf] School-
(2) bus. So late, already.!? [ghjk]"""
#res = re.sub(r"[\(\[].*?[\)\]]", "", string)
res = re.sub("(\d+) ","", res)
res = re.sub("[.*]","", res)
res = re.sub(r"-\s","", res)
res = re.sub(r"[^\w\säüöß]","", res)
res = re.sub("-\n","", res)
print(res.split())
So I was trying to remove anything in brackets () and [] with my #commented line, but then I am stuck with a whitespace starting of each line. Then I decided to split it up and came up the the five re.sub methods.
Result should be like this:
['At', 'what', 'time', 'Schoolbus', 'So', 'late', 'already']
I am stuck with the Linenumbers not being removed, although they are in () and should be gone. Which then causes my res.sub() for connecting words with "-" from school- bus to schoolbus to not work aswell.
You may use this sub + findall
solution:
import re
string = """(1) At what time.!? [asdf] School-
(2) bus. So late, already.!? [ghjk]"""
print (re.findall(r'\b\w+(?:-\w+)*', re.sub(r'(\([^)]*\)|\[[^]]*\]|-)\s*', '', string)))
Output:
['At', 'what', 'time', 'Schoolbus', 'So', 'late', 'already']
Details:
re.sub(r'(\([^)]*\)|\[[^]]*\]|-)\s*', '', string)
: Removes all (...)
and [...]
or -
strings followed by 0 or more spaces \b\w+
: Matches 1+ word characters starting with a word boundary
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.