CountVectorizer fit_transform 花费的时间太长

Question

def tokenize(text):
    text = re.sub('[^a-zA-Z0-9]', ' ', text)
    words = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(w.lower().strip()) for w in words if w not in stopwords.words()]
    return words

pipeline = Pipeline([
('vect', CountVectorizer(tokenizer=tokenize)),
#     ('tfidf', TfidfTransformer()),
#     ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

Given above code, CountVectorizer takes too long (ran for 60 minutes but it did not finished) to fit but if I remove line if w not in stopwords.words() it just take 5 minutes to fit, what could be problem and possible solution with this code.给定上面的代码，CountVectorizer 需要太长时间（运行 60 分钟但没有完成）来适应，但是如果我删除行if w not in stopwords.words()中，只需要 5 分钟来适应，可能是什么问题和可能的解决方案使用此代码。 I am using stop words from nltk.corpus.我正在使用来自 nltk.corpus 的停用词。

Note: tokenize function works fine, using separately for any text input.注意：标记 function 工作正常，单独用于任何文本输入。

Thank you谢谢

Answer 1

My first guess is that the function stopwords.words() does some heavy job on each call.我的第一个猜测是 function stopwords.words() 在每次调用时都会做一些繁重的工作。 Maybe, you could try caching it.也许，您可以尝试缓存它。 The same is true for lemmatizer: calling the constructor only once can speed up the code significantly. lemmatizer 也是如此：只调用一次构造函数可以显着加快代码速度。

stop_set = set(stopwords.words())
lemmatizer = WordNetLemmatizer()

def tokenize(text):
    text = re.sub('[^a-zA-Z0-9]', ' ', text)
    words = word_tokenize(text)
    words = [lemmatizer.lemmatize(w.lower().strip()) for w in words if w not in stop_set]
    return words

in my experience, it can help to cache even the lemmatization function, like根据我的经验，它甚至可以帮助缓存词形还原 function，例如

from functools import lru_cache

stop_set = set(stopwords.words())
lemmatizer = WordNetLemmatizer()

@lru_cache(maxsize=10000)
def lemmatize(word):
    return lemmatizer.lemmatize(w.lower().strip())

def tokenize(text):
    text = re.sub('[^a-zA-Z0-9]+', ' ', text)
    words = [lemmatize(w) for w in word_tokenize(text)]
    return [w for w in words if w not in stop_set]

CountVectorizer fit_transform 花费的时间太长

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-06-01 22:00:04

CountVectorizer fit_transform 花费的时间太长

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-06-01 22:00:04

解决方案1
1 已采纳 2020-06-01 22:00:04