[英]Determine word boundaries in a python string
I have filepaths in the format of:我有以下格式的文件路径:
THISISSOMEMOVIE.mov
Is there some NLP library that can make very educated/statistical guesses about the word boundaries in a string?是否有一些 NLP 库可以对字符串中的单词边界进行非常有根据的/统计猜测? For example, the above should be parsed as:
例如,上面应该被解析为:
THIS IS SOME MOVIE mov
I don't know of a library that does just that but you could use PyEnchant that tells you if a word belongs to the dictionary.我不知道有一个库可以做到这一点,但你可以使用PyEnchant来告诉你一个词是否属于字典。
So here's the pseudo code of what I'd do:所以这是我要做的伪代码:
s = 0
i = len(title) - 1
check if the substring s-i is in the dictionary
if not i = i - 1
if yes then s becomes i+1, and i = len(title) - 1 again
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.