单词标记化需要太多时间来运行

Question

我使用 Pythainlp 包来标记我的泰语数据以进行情感分析。 首先，我构建了一个函数来添加新词集并对其进行标记化

from pythainlp.corpus.common import thai_words
from pythainlp.util import dict_trie
from pythainlp import word_tokenize

def text_tokenize(Mention):
 new_words = {'คนละครึ่ง', 'ยืนยันตัวตน', 'เติมเงิน', 'เราชนะ', 'เป๋าตัง', 'แอปเป๋าตัง'}
 words = new_words.union(thai_words())
 custom_dictionary_trie = dict_trie(words)
 dataa = word_tokenize(Mention, custom_dict=custom_dictionary_trie, keep_whitespace=False)
 return dataa

之后，我将它应用到我的 text_process 函数中，其中包括删除标点符号和停用词。

puncuations = '''.?!,;:-_[]()'/<>{}\@#$&%~*ๆฯ'''
from pythainlp import word_tokenize
def text_process(Mention):
  final = "".join(u for u in Mention if u not in puncuations and ('ๆ', 'ฯ'))
  final = text_tokenize(final)
  final = " ".join(word for word in final)
  final = " ".join(word for word in final.split() if word.lower not in thai_stopwords)
  return final

dff['text_tokens'] = dff['Mention'].apply(text_process) 
dff

关键是运行这个函数需要很长时间。 花了 17 分钟，但仍未完成。 我试图用final = text_tokenize(final)替换final = word_tokenize(final)

只用了 2 分钟，但我不能再使用它了，因为我需要添加新的自定义字典。 我知道有问题但真的不知道如何解决

我是 python 和 nlp 的新手，所以请帮忙。 附言。 对不起我的英语蹩脚

Answer 1

我不熟悉泰语，但假设对于标记化，您还可以使用与语言无关的标记化工具。

如果要执行单词标记化，请尝试以下示例：

from nltk.tokenize import word_tokenize
s = '''This is the text I want to tokenize'''
word_tokenize(s)

>>> ['This', 'is', 'the', 'text', 'I', 'want', 'to', 'tokenize']

单词标记化需要太多时间来运行

问题描述

1 个解决方案

解决方案1
0 2021-10-13 13:47:56

单词标记化需要太多时间来运行

问题描述

1 个解决方案

解决方案1 0 2021-10-13 13:47:56

解决方案1
0 2021-10-13 13:47:56