[英]CountVectorizer takes too long to fit_transform
def tokenize(text):
text = re.sub('[^a-zA-Z0-9]', ' ', text)
words = word_tokenize(text)
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(w.lower().strip()) for w in words if w not in stopwords.words()]
return words
pipeline = Pipeline([
('vect', CountVectorizer(tokenizer=tokenize)),
# ('tfidf', TfidfTransformer()),
# ('clf', MultiOutputClassifier(RandomForestClassifier()))
])
Given above code, CountVectorizer takes too long (ran for 60 minutes but it did not finished) to fit but if I remove line if w not in stopwords.words()
it just take 5 minutes to fit, what could be problem and possible solution with this code.给定上面的代码,CountVectorizer 需要太长时间(运行 60 分钟但没有完成)来适应,但是如果我删除行if w not in stopwords.words()
中,只需要 5 分钟来适应,可能是什么问题和可能的解决方案使用此代码。 I am using stop words from nltk.corpus.我正在使用来自 nltk.corpus 的停用词。
Note: tokenize function works fine, using separately for any text input.注意:标记 function 工作正常,单独用于任何文本输入。
Thank you谢谢
My first guess is that the function stopwords.words() does some heavy job on each call.我的第一个猜测是 function stopwords.words() 在每次调用时都会做一些繁重的工作。 Maybe, you could try caching it.也许,您可以尝试缓存它。 The same is true for lemmatizer: calling the constructor only once can speed up the code significantly. lemmatizer 也是如此:只调用一次构造函数可以显着加快代码速度。
stop_set = set(stopwords.words())
lemmatizer = WordNetLemmatizer()
def tokenize(text):
text = re.sub('[^a-zA-Z0-9]', ' ', text)
words = word_tokenize(text)
words = [lemmatizer.lemmatize(w.lower().strip()) for w in words if w not in stop_set]
return words
in my experience, it can help to cache even the lemmatization function, like根据我的经验,它甚至可以帮助缓存词形还原 function,例如
from functools import lru_cache
stop_set = set(stopwords.words())
lemmatizer = WordNetLemmatizer()
@lru_cache(maxsize=10000)
def lemmatize(word):
return lemmatizer.lemmatize(w.lower().strip())
def tokenize(text):
text = re.sub('[^a-zA-Z0-9]+', ' ', text)
words = [lemmatize(w) for w in word_tokenize(text)]
return [w for w in words if w not in stop_set]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.