简体   繁体   English

CountVectorizer fit_transform 花费的时间太长

[英]CountVectorizer takes too long to fit_transform

def tokenize(text):
    text = re.sub('[^a-zA-Z0-9]', ' ', text)
    words = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(w.lower().strip()) for w in words if w not in stopwords.words()]
    return words

pipeline = Pipeline([
('vect', CountVectorizer(tokenizer=tokenize)),
#     ('tfidf', TfidfTransformer()),
#     ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

Given above code, CountVectorizer takes too long (ran for 60 minutes but it did not finished) to fit but if I remove line if w not in stopwords.words() it just take 5 minutes to fit, what could be problem and possible solution with this code.给定上面的代码,CountVectorizer 需要太长时间(运行 60 分钟但没有完成)来适应,但是如果我删除行if w not in stopwords.words()中,只需要 5 分钟来适应,可能是什么问题和可能的解决方案使用此代码。 I am using stop words from nltk.corpus.我正在使用来自 nltk.corpus 的停用词。

Note: tokenize function works fine, using separately for any text input.注意:标记 function 工作正常,单独用于任何文本输入。

Thank you谢谢

My first guess is that the function stopwords.words() does some heavy job on each call.我的第一个猜测是 function stopwords.words() 在每次调用时都会做一些繁重的工作。 Maybe, you could try caching it.也许,您可以尝试缓存它。 The same is true for lemmatizer: calling the constructor only once can speed up the code significantly. lemmatizer 也是如此:只调用一次构造函数可以显着加快代码速度。

stop_set = set(stopwords.words())
lemmatizer = WordNetLemmatizer()

def tokenize(text):
    text = re.sub('[^a-zA-Z0-9]', ' ', text)
    words = word_tokenize(text)
    words = [lemmatizer.lemmatize(w.lower().strip()) for w in words if w not in stop_set]
    return words 

in my experience, it can help to cache even the lemmatization function, like根据我的经验,它甚至可以帮助缓存词形还原 function,例如

from functools import lru_cache

stop_set = set(stopwords.words())
lemmatizer = WordNetLemmatizer()

@lru_cache(maxsize=10000)
def lemmatize(word):
    return lemmatizer.lemmatize(w.lower().strip())

def tokenize(text):
    text = re.sub('[^a-zA-Z0-9]+', ' ', text)
    words = [lemmatize(w) for w in word_tokenize(text)]
    return [w for w in words if w not in stop_set]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在CountVectorizer中使用`transform` vs.`fit_transform`的问题 - Issue with usages of `transform` vs. `fit_transform` in CountVectorizer sklearn countvectorizer 中的 fit_transform 和 transform 有什么区别? - What is the difference between fit_transform and transform in sklearn countvectorizer? 添加停用词后CountVectorizer在fit_transform上引发错误 - CountVectorizer throws error on fit_transform after adding stop words CountVectorizer().fit_transform() 是否保留输入顺序? - Does CountVectorizer().fit_transform() preserve order of input? fit_transform() 需要 2 个位置参数,但 3 个是通过 LabelBinarizer 给出的 - fit_transform() takes 2 positional arguments but 3 were given with LabelBinarizer "TypeError: fit_transform() 接受 2 个位置参数,但给出了 3 个" - TypeError: fit_transform() takes 2 positional arguments but 3 were given CountVectorizer fit_transform 错误:TypeError:预期的字符串或类似字节的 object - CountVectorizer fit_transform error: TypeError: expected string or bytes-like object 使用 fit_transform() 和 transform() - Using fit_transform() and transform() Sklearn-FeatureUnion-变形金刚:TypeError:fit_transform()接受2个位置参数,但给出了3个 - Sklearn - FeatureUnion - Transformer: TypeError: fit_transform() takes 2 positional arguments but 3 were given 如何解决“ TypeError:fit_transform()需要2个位置参数,但给出了3个” - How to fix “TypeError: fit_transform() takes 2 positional arguments but 3 were given”
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM