sklearn feature_extraction fit parallelization

Question

I'm trying to build a vectorizer for a text mining problem. The used vocabulary should be fitted from given files. However, the number of files that will build the dictionary vocabulary_ is relatively large (say 10^5). Is there a simple way to parallelize that?

Update: As I found out, there is a "manual" way... Unfortunately, it only works for min_df=1 Let me exemplary describe what I do for two cores: Split your input into two chunks. Train vectorizers (say vec1 and vec2), each on one core and on one chunk of your data (I used multiprocessing.Pool ). Then,

# Use sets to dedupe token
vocab = set(vec1.vocabulary_) | set(vec2.vocabulary_)
# Create final vectorizer with given vocabulary
final_vec = CountVectorizer(vocabulary=vocab)
# Create the dictionary final_vec.vocabulary_ 
final_vec._validate_vocabulary()

will do the job.

Answer 1

You can use mllib , the machine learning library included in apache-spark wchich will handle the distribution accross nodes.

Here's a tutorial on how to use it for feature extraction.

https://spark.apache.org/docs/latest/mllib-feature-extraction.html

You can also check the sklearn documentation on How to optimize for speed here to get some inspiration.

sklearn feature_extraction fit parallelization

Question

1 answers

solution1
0 2017-09-04 14:59:27

sklearn feature_extraction fit parallelization

Question

1 answers

solution1 0 2017-09-04 14:59:27

solution1
0 2017-09-04 14:59:27