简体   繁体   中英

Problems using a custom vocabulary for TfidfVectorizer scikit-learn

I'm trying to use a custom vocabulary in scikit-learn for some clustering tasks and I'm getting very weird results.

The program runs ok when not using a custom vocabulary and I'm satisfied with the cluster creation. However, I have already identified a group of words (around 24,000) that I would like to use as a custom vocabulary.

The words are stored in a SQL Server table. I have tried so far 2 approaches, but I get the same results at the end. The first one is to create a list, the second is to create a dictionary. The code for the creation of the dictionary is like this:

myvocab = {}
vocabulary = []

count = 0

for row in results:
    skillName = re.sub(r'&#?[a-z0-9]+;', ' ', row['SkillName']) 
    skillName = unicode(skillName,"utf-8")  
    vocabulary.append(skillName)  #Using a list 
    myvocab[str(skillName)] = count #Using a dictionary
    count+=1

I then use the vocabulary (either the list version or the dictionary, both of them give the same result at the end) in the TfidfVectorizer as follows:

vectorizer = TfidfVectorizer(max_df=0.8, 
                         stop_words='english' ,ngram_range=(1,2) ,vocabulary=myvocab)
X = vectorizer.fit_transform(dataset2)

The shape of X is (651, 24321) as I have 651 instances to cluster and 24321 words in the vocabulary.

If I print the contents of X, this is what I get:

(14, 11462) 1.0
(20, 10218) 1.0
(34, 11462) 1.0
(40, 11462) 0.852815313278
(40, 10218) 0.52221264006
(50, 11462) 1.0
(81, 11462) 1.0
(84, 11462) 1.0
(85, 11462) 1.0
(99, 10218) 1.0
(127, 11462)    1.0
(129, 10218)    1.0
(132, 11462)    1.0
(136, 11462)    1.0
(138, 11462)    1.0
(150, 11462)    1.0
(158, 11462)    1.0
(186, 11462)    1.0
(210, 11462)    1.0

:   :

As it can be seen, for most of the instances, only word from the vocabulary is present (which is wrong as there are at least 10) and for a lot of instances, not even one word is found. Also, the words found tend to be always the same across the instances, which doesn't make sense.

If I print the feature_names using :

feature_names = np.asarray(vectorizer.get_feature_names())

I get:

['.NET' '10K' '21 CFR Part 11' ..., 'Zend Studio' 'Zendesk' 'Zenworks']

I must say that the program was running perfectly when the vocabulary used was the one determined from the input documents, so I strongly suspect that the problem is related to using a custom vocabulary.

Does anyone have a clue of what's happening?

(I'm not using a pipeline so this problem can't be related to a previous bug which has already been fixed)

One thing that strikes me as unusual is that when you create the vectorizer you specify ngram_range=(1,2) . This means you can't get the feature '21 CFR Part 11' using the standard tokenizer. I suspect the 'missing' features are n-grams for n>2 . How many of your pre-selected vocabulary items are unigrams or bigrams?

I am pretty sure that this is caused by the (arguably confusing) default value of min_df=2 to cut off any feature from the vocabulary if it's not occurring at least twice in the dataset. Can you please confirm by setting explicitly min_df=1 in your code?

In Python for-in loop, it could not use count+=1 to make count add one when every loop. You could use for i in range(n): to replace it. Because count's value would stay 1.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM