简体   繁体   English

您可以使用字典(文本)进行正则表达式标记化吗?

[英]Can you use dictionary( text) to regex tokenization?

I was wondering if we can use a text files as a means for tokenization. 我想知道我们是否可以使用文本文件作为标记化的手段。 For example let's say there is a file(dictionary) and you want to tokenize you check first dictionary to tokenize. 例如,假设有一个文件(字典),并且您想要标记化,请检查第一个词典以标记化。

Eg: 例如:

Dict_list = [environment test, apple cat, test rest] Dict_list = [环境测试,苹果猫,测试休息]

Text : The environment test is the best apple in the world apple cat is the test rest. 文字:环境测试是世界上最好的苹果苹果猫是测试的其余部分。

Assume the text list is big and dict is also big, so if we want to tokenize it would tokenize by spaces however I need to tokenize whole text however I want to check dict_list to see if that should be one token. 假设文本列表很大,而dict也很大,所以如果我们要标记化它会用空格标记化,但是我需要对整个文本进行标记化,但是我想检查dict_list看看是否应该是一个标记。

so the token should be: 因此令牌应为:

Token : "The", "environment test", "is", "the", "best apple", "in", "the", "world", "apple cat", "is", "the", "test rest". 令牌:“ The”,“环境测试”,“ is”,“ the”,“ best apple”,“ in”,“ the”,“ world”,“ apple cat”,“ is”,“ the”,“测试休息”。

I hope this makes sense. 我希望这是有道理的。

Thank you in advance. 先感谢您。

With nltk.tokenize package you can easily do this. 使用nltk.tokenize软件包,您可以轻松地做到这一点。 For example: 例如:

>>> tokenizer.tokenize('Testing testing testing one two three'.split())
['Testing', 'testing', 'testing', 'one', 'two', 'three']

>>> tokenizer = MWETokenizer([('hors', "d'oeuvre")], separator='+')
>>> tokenizer.tokenize("An hors d'oeuvre tonight, sir?".split())
['An', "hors+d'oeuvre", 'tonight,', 'sir?']

This is one way but a workaround: 这是一种解决方法:

Python3 version: Python3版本:

from nltk.tokenize import regexp_tokenize

sent = "I like apple fruit but grape fruit more"
dict_list = ["apple fruit", "grape fruit"]
newdict = {}
for item in dict_list:
    dk = item.replace(" ", "_")
    newdict[item] = dk

for key, val in newdict.items():
    if key in sent:
        sent = sent.replace(key, val)

res = regexp_tokenize(sent, pattern='\S+')
print(res)

Output: 输出:

['I', 'like', 'apple_fruit', 'but', 'grape_fruit', 'more'] ['I','like','apple_fruit','but','grape_fruit','more']

You can then replace all the underscores with space if you wish. 然后,您可以根据需要用空格替换所有下划线。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM