简体   繁体   中英

sklearn feature_extraction fit parallelization

I'm trying to build a vectorizer for a text mining problem. The used vocabulary should be fitted from given files. However, the number of files that will build the dictionary vocabulary_ is relatively large (say 10^5). Is there a simple way to parallelize that?

Update: As I found out, there is a "manual" way... Unfortunately, it only works for min_df=1 Let me exemplary describe what I do for two cores: Split your input into two chunks. Train vectorizers (say vec1 and vec2), each on one core and on one chunk of your data (I used multiprocessing.Pool ). Then,

# Use sets to dedupe token
vocab = set(vec1.vocabulary_) | set(vec2.vocabulary_)
# Create final vectorizer with given vocabulary
final_vec = CountVectorizer(vocabulary=vocab)
# Create the dictionary final_vec.vocabulary_ 
final_vec._validate_vocabulary()

will do the job.

You can use mllib , the machine learning library included in apache-spark wchich will handle the distribution accross nodes.

Here's a tutorial on how to use it for feature extraction.

You can also check the sklearn documentation on How to optimize for speed here to get some inspiration.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM