您可以使用字典（文本）进行正则表达式标记化吗？

Question

I was wondering if we can use a text files as a means for tokenization. 我想知道我们是否可以使用文本文件作为标记化的手段。 For example let's say there is a file(dictionary) and you want to tokenize you check first dictionary to tokenize. 例如，假设有一个文件（字典），并且您想要标记化，请检查第一个词典以标记化。

Eg: 例如：

Dict_list = [environment test, apple cat, test rest] Dict_list = [环境测试，苹果猫，测试休息]

Text : The environment test is the best apple in the world apple cat is the test rest. 文字：环境测试是世界上最好的苹果苹果猫是测试的其余部分。

Assume the text list is big and dict is also big, so if we want to tokenize it would tokenize by spaces however I need to tokenize whole text however I want to check dict_list to see if that should be one token. 假设文本列表很大，而dict也很大，所以如果我们要标记化它会用空格标记化，但是我需要对整个文本进行标记化，但是我想检查dict_list看看是否应该是一个标记。

so the token should be: 因此令牌应为：

Token : "The", "environment test", "is", "the", "best apple", "in", "the", "world", "apple cat", "is", "the", "test rest". 令牌：“ The”，“环境测试”，“ is”，“ the”，“ best apple”，“ in”，“ the”，“ world”，“ apple cat”，“ is”，“ the”，“测试休息”。

I hope this makes sense. 我希望这是有道理的。

Thank you in advance. 先感谢您。

Answer 1

With nltk.tokenize package you can easily do this. 使用nltk.tokenize软件包，您可以轻松地做到这一点。 For example: 例如：

>>> tokenizer.tokenize('Testing testing testing one two three'.split())
['Testing', 'testing', 'testing', 'one', 'two', 'three']

>>> tokenizer = MWETokenizer([('hors', "d'oeuvre")], separator='+')
>>> tokenizer.tokenize("An hors d'oeuvre tonight, sir?".split())
['An', "hors+d'oeuvre", 'tonight,', 'sir?']

Answer 2

This is one way but a workaround: 这是一种解决方法：

Python3 version: Python3版本：

from nltk.tokenize import regexp_tokenize

sent = "I like apple fruit but grape fruit more"
dict_list = ["apple fruit", "grape fruit"]
newdict = {}
for item in dict_list:
    dk = item.replace(" ", "_")
    newdict[item] = dk

for key, val in newdict.items():
    if key in sent:
        sent = sent.replace(key, val)

res = regexp_tokenize(sent, pattern='\S+')
print(res)

Output: 输出：

['I', 'like', 'apple_fruit', 'but', 'grape_fruit', 'more'] ['I'，'like'，'apple_fruit'，'but'，'grape_fruit'，'more']

You can then replace all the underscores with space if you wish. 然后，您可以根据需要用空格替换所有下划线。

您可以使用字典（文本）进行正则表达式标记化吗？

问题描述

2 个解决方案

解决方案1
1 2015-10-16 20:02:54

解决方案2
1 已采纳 2015-10-16 21:11:00

您可以使用字典（文本）进行正则表达式标记化吗？

问题描述

2 个解决方案

解决方案1 1 2015-10-16 20:02:54

解决方案2 1 已采纳 2015-10-16 21:11:00

解决方案1
1 2015-10-16 20:02:54

解决方案2
1 已采纳 2015-10-16 21:11:00