简体   繁体   English

使用前面带有非字母字符的空格拆分文本

[英]Split text with a space that is preceded with a non-letter char

since I couldn't find any solution in the internet, I just thought of asking my question on here.由于我在互联网上找不到任何解决方案,我只是想在这里提出我的问题。

I want to split a given text at every punctuation.我想在每个标点符号处拆分给定的文本。 So not only after every sentence, but also after a comma for example.所以不仅在每个句子之后,而且在逗号之后。 I came across the natural language toolkit (tltk) and regular expressions so far, but I had no success with them.到目前为止,我遇到了自然语言工具包 (tltk) 和正则表达式,但我没有成功。

This is what works quite good, but does not fulfil my expectations completely:这是非常有效的方法,但不能完全满足我的期望:

sample_text = """With this example I wanna make the point clear... I hope you get it! There are many coding
languages out there, but which is the best? I would say there's no best. Change my mind - if you can!"""

split_text = nltk.tokenize.sent_tokenize(sample_text)
print(split_text)

#Output: ['With this example I wanna make the point clear...', 'I hope you get it!', 'There are many coding languages out there, but which is the best?', "I would say there's no best.", 'Change my mind - if you can!']

This is quite okay already, but I preferably would like to receive an output, which even splits the text at commas or a hyphen.这已经很好了,但我更希望收到一个输出,它甚至用逗号或连字符分割文本。 So the output would look like this:所以输出看起来像这样:

[
 'With this example I wanna make the point clear...',
 'I hope you get it!',
 'There are many coding languages out there,',
 'but which is the best?',
 "I would say there's no best.",
 'Change my mind -',
 'if you can!'
]

It's probably better to use regular expressions isn't it?使用正则表达式可能更好,不是吗? But somehow I don't get it working.但不知何故,我不明白它的工作。 Thanks in advance, appreciate any help!提前致谢,感谢任何帮助!

正则表达式效果很好,尝试在 .split() 中使用这个表达式

[!"\\#$%&'()*+,\\-.\\/:;<=>?@\\[\\\\\\]^_'{|}~]

You could split the string on a space which is not preceded by a letter:您可以在前面没有字母的空格上拆分字符串:

split_text = re.split('(?<=[^a-z]) ', sample_text, 0, re.I)
print(split_text)

Output:输出:

[
 'With this example I wanna make the point clear...',
 'I hope you get it!',
 'There are many coding languages out there,',
 'but which is the best?',
 "I would say there's no best.",
 'Change my mind -',
 'if you can!'
]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Django 3:过滤字母或非字母的查询集 - Django 3: Filter queryset for letter or non-letter 从文本文件中删除所有标点符号、空格和其他非字母字符,包括数字 - Removing all punctuation, spaces and other non-letter characters including numbers from a text file 正则表达式以大写形式拆分,但如果以空格开头则不会拆分吗? - Regex to split on uppercase but not if preceded by space? 快速删除字符串中的所有非字母字符 - Deleting all non-letter characters from a string fast, python 从单词的开头和结尾删除非字母字符 - Remove non-letter characters from beginning and end of a word 从 Python 中带重音的字符串中删除所有非字母字符 - Removing all non-letter chars from a string with accents in Python 如何检查字符串中的字符是否为非字母? - How do I check if a character in a string is a non-letter? 查找以大写字母作为起始字母但前面没有空格的单词 - find words with capital letter as starting letter but not preceded by space 如果大写字母前面和后面跟着一个小写字母,则插入空格 - Python - Insert space if uppercase letter is preceded and followed by one lowercase letter - Python 如何从字符串中删除所有非字母(所有语言)和非数字字符? - How can I remove all non-letter (all languages) and non-numeric characters from a string?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM