[英]Can you use dictionary( text) to regex tokenization?
I was wondering if we can use a text files as a means for tokenization. 我想知道我们是否可以使用文本文件作为标记化的手段。 For example let's say there is a file(dictionary) and you want to tokenize you check first dictionary to tokenize.
例如,假设有一个文件(字典),并且您想要标记化,请检查第一个词典以标记化。
Eg: 例如:
Dict_list = [environment test, apple cat, test rest] Dict_list = [环境测试,苹果猫,测试休息]
Text : The environment test is the best apple in the world apple cat is the test rest. 文字:环境测试是世界上最好的苹果苹果猫是测试的其余部分。
Assume the text list is big and dict is also big, so if we want to tokenize it would tokenize by spaces however I need to tokenize whole text however I want to check dict_list to see if that should be one token. 假设文本列表很大,而dict也很大,所以如果我们要标记化它会用空格标记化,但是我需要对整个文本进行标记化,但是我想检查dict_list看看是否应该是一个标记。
so the token should be: 因此令牌应为:
Token : "The", "environment test", "is", "the", "best apple", "in", "the", "world", "apple cat", "is", "the", "test rest". 令牌:“ The”,“环境测试”,“ is”,“ the”,“ best apple”,“ in”,“ the”,“ world”,“ apple cat”,“ is”,“ the”,“测试休息”。
I hope this makes sense. 我希望这是有道理的。
Thank you in advance. 先感谢您。
With nltk.tokenize package you can easily do this. 使用nltk.tokenize软件包,您可以轻松地做到这一点。 For example:
例如:
>>> tokenizer.tokenize('Testing testing testing one two three'.split())
['Testing', 'testing', 'testing', 'one', 'two', 'three']
>>> tokenizer = MWETokenizer([('hors', "d'oeuvre")], separator='+')
>>> tokenizer.tokenize("An hors d'oeuvre tonight, sir?".split())
['An', "hors+d'oeuvre", 'tonight,', 'sir?']
This is one way but a workaround: 这是一种解决方法:
Python3 version: Python3版本:
from nltk.tokenize import regexp_tokenize
sent = "I like apple fruit but grape fruit more"
dict_list = ["apple fruit", "grape fruit"]
newdict = {}
for item in dict_list:
dk = item.replace(" ", "_")
newdict[item] = dk
for key, val in newdict.items():
if key in sent:
sent = sent.replace(key, val)
res = regexp_tokenize(sent, pattern='\S+')
print(res)
Output: 输出:
['I', 'like', 'apple_fruit', 'but', 'grape_fruit', 'more'] ['I','like','apple_fruit','but','grape_fruit','more']
You can then replace all the underscores with space if you wish. 然后,您可以根据需要用空格替换所有下划线。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.