简体   繁体   中英

Online version of scikit-learn's TfidfVectorizer

I'm looking to use scikit-learn's HashingVectorizer because it's a great fit for online learning problems (new tokens in text are guaranteed to map to a "bucket"). Unfortunately the implementation included in scikit-learn doesn't seem to include support for tf-idf features. Is passing the vectorizer output through a TfidfTransformer the only way to make online updates work with tf-idf features, or is there a more elegant solution out there?

Intrinsically you can not use TF IDF in an online fashion, as the IDF of all past features will change with every new document - which would mean re-visiting and re-training on all the previous documents, which would no-longer be online.

There may be some approximations, but you would have to implement them yourself.

You can do "online" TF-IDF, contrary to what was said in the accepted answer.

In fact, every search engine (eg Lucene) does.

What does not work if assuming you have TF-IDF vectors in memory.

Search engines such as lucene naturally avoid keeping all data in memory. Instead they load one column at a time (which due to sparsity is not a lot). IDF arises trivially from the length of the inverted list.

The point is, you don't transform your data into TF-IDF, and then do standard cosine similarity.

Instead, you use the current IDF weights when computing similarities, using a weighted cosine similarity (often modified with additional weighting, boosting terms, penalizing terms, etc.)

This approach will work essentially with any algorithm that allows attribute weighting at evaluation time . Many algorithms will do, but very few implementations are flexible enough, unfortunately. Most expect you to multiply the weights into your data matrix before training, unfortunately.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM