[英]Split text with a space that is preceded with a non-letter char
since I couldn't find any solution in the internet, I just thought of asking my question on here.由于我在互联网上找不到任何解决方案,我只是想在这里提出我的问题。
I want to split a given text at every punctuation.我想在每个标点符号处拆分给定的文本。 So not only after every sentence, but also after a comma for example.
所以不仅在每个句子之后,而且在逗号之后。 I came across the natural language toolkit (tltk) and regular expressions so far, but I had no success with them.
到目前为止,我遇到了自然语言工具包 (tltk) 和正则表达式,但我没有成功。
This is what works quite good, but does not fulfil my expectations completely:这是非常有效的方法,但不能完全满足我的期望:
sample_text = """With this example I wanna make the point clear... I hope you get it! There are many coding
languages out there, but which is the best? I would say there's no best. Change my mind - if you can!"""
split_text = nltk.tokenize.sent_tokenize(sample_text)
print(split_text)
#Output: ['With this example I wanna make the point clear...', 'I hope you get it!', 'There are many coding languages out there, but which is the best?', "I would say there's no best.", 'Change my mind - if you can!']
This is quite okay already, but I preferably would like to receive an output, which even splits the text at commas or a hyphen.这已经很好了,但我更希望收到一个输出,它甚至用逗号或连字符分割文本。 So the output would look like this:
所以输出看起来像这样:
[
'With this example I wanna make the point clear...',
'I hope you get it!',
'There are many coding languages out there,',
'but which is the best?',
"I would say there's no best.",
'Change my mind -',
'if you can!'
]
It's probably better to use regular expressions isn't it?使用正则表达式可能更好,不是吗? But somehow I don't get it working.
但不知何故,我不明白它的工作。 Thanks in advance, appreciate any help!
提前致谢,感谢任何帮助!
正则表达式效果很好,尝试在 .split() 中使用这个表达式[!"\\#$%&'()*+,\\-.\\/:;<=>?@\\[\\\\\\]^_'{|}~]
You could split the string on a space which is not preceded by a letter:您可以在前面没有字母的空格上拆分字符串:
split_text = re.split('(?<=[^a-z]) ', sample_text, 0, re.I)
print(split_text)
Output:输出:
[
'With this example I wanna make the point clear...',
'I hope you get it!',
'There are many coding languages out there,',
'but which is the best?',
"I would say there's no best.",
'Change my mind -',
'if you can!'
]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.