[英]Python Regex remove comments or numbers in brackets
I am trying to remove line numbers and comments using regex, but it does not work just yet:我正在尝试使用正则表达式删除行号和注释,但它还不起作用:
import re
string = """(1) At what time.!? [asdf] School-
(2) bus. So late, already.!? [ghjk]"""
#res = re.sub(r"[\(\[].*?[\)\]]", "", string)
res = re.sub("(\d+) ","", res)
res = re.sub("[.*]","", res)
res = re.sub(r"-\s","", res)
res = re.sub(r"[^\w\säüöß]","", res)
res = re.sub("-\n","", res)
print(res.split())
So I was trying to remove anything in brackets () and [] with my #commented line, but then I am stuck with a whitespace starting of each line.所以我试图用我的#commented 行删除括号 () 和 [] 中的任何内容,但后来我被每行开头的空格卡住了。 Then I decided to split it up and came up the the five re.sub methods.然后我决定将其拆分并提出五种 re.sub 方法。
Result should be like this:结果应该是这样的:
['At', 'what', 'time', 'Schoolbus', 'So', 'late', 'already']
I am stuck with the Linenumbers not being removed, although they are in () and should be gone.我坚持没有被删除的行号,尽管它们在 () 中并且应该消失了。 Which then causes my res.sub() for connecting words with "-" from school- bus to schoolbus to not work aswell.然后导致我的 res.sub() 用于将单词与从校车到校车的“-”连接起来也不起作用。
You may use this sub + findall
solution:您可以使用这个sub + findall
解决方案:
import re
string = """(1) At what time.!? [asdf] School-
(2) bus. So late, already.!? [ghjk]"""
print (re.findall(r'\b\w+(?:-\w+)*', re.sub(r'(\([^)]*\)|\[[^]]*\]|-)\s*', '', string)))
Output: Output:
['At', 'what', 'time', 'Schoolbus', 'So', 'late', 'already']
Details:细节:
re.sub(r'(\([^)]*\)|\[[^]]*\]|-)\s*', '', string)
: Removes all (...)
and [...]
or -
strings followed by 0 or more spaces re.sub(r'(\([^)]*\)|\[[^]]*\]|-)\s*', '', string)
:删除所有(...)
和[...]
或-
后跟 0 个或多个空格的字符串\b\w+
: Matches 1+ word characters starting with a word boundary \b\w+
:匹配以单词边界开头的 1+ 个单词字符
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.