简体   繁体   English

检测文本中的英语单词

[英]Detect english words in text

I have a dataset which is crawled but also contains entries that have a lot of junk in them.我有一个被抓取的数据集,但也包含其中包含大量垃圾的条目。

Name: sdfsdfsdfsd
Location: asdfdgdfjkgdsfjs
Education: Science & Literature 

Currently its being stored in MySQL and Solr.目前它被存储在 MySQL 和 Solr 中。
Is there any library that can look for English words in these fields so that i may eliminate the garbage values?有没有可以在这些字段中查找英文单词的库,以便我可以消除垃圾值? I believe it would need a dictionary and the default unix dictionary in /usr/share/dict/ seems enough for this use-case.我相信它需要一个字典,而/usr/share/dict/ 中的默认 unix 字典似乎足以满足这个用例。

with open('/usr/share/dict/words') as f:
    words = set(word.lower() for word in f.read().split()
                # Really short words aren't much of an indication
                if len(word) > 3)

def is_english(text):
    return bool(words.intersection(text.lower().split()))
    # or
    return any(word in words for word in text.lower().split())

print(is_english('usfdbg dsuyfbg cat'))
print(is_english('Science & Literature'))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM