简体   繁体   English

确定 python 字符串中的字边界

[英]Determine word boundaries in a python string

I have filepaths in the format of:我有以下格式的文件路径:

THISISSOMEMOVIE.mov

Is there some NLP library that can make very educated/statistical guesses about the word boundaries in a string?是否有一些 NLP 库可以对字符串中的单词边界进行非常有根据的/统计猜测? For example, the above should be parsed as:例如,上面应该被解析为:

THIS IS SOME MOVIE mov

I don't know of a library that does just that but you could use PyEnchant that tells you if a word belongs to the dictionary.我不知道有一个库可以做到这一点,但你可以使用PyEnchant来告诉你一个词是否属于字典。

So here's the pseudo code of what I'd do:所以这是我要做的伪代码:

 s = 0
 i = len(title) - 1
 check if the substring s-i is in the dictionary
    if not i = i - 1
    if yes then s becomes i+1, and i = len(title) - 1 again

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM