简体   繁体   中英

Creating custom Count Vectorizer with Scikit-Learn

I want to create a custom CountVectorizer with Python and Scikit-Learn lib. I wrote a code that extracts phrases from Pandas dataframe with TextBlob lib, and I want from my Vecotrizer to count those phrases.

My code:

from textblob import TextBlob
import pandas as pd

my_list = ["I want to buy a nice bike for my girl. She broke her old bike last year.", 
        "I had a great time watching that movie last night. We shouuld do the same next week", 
        "Where can I buy some tasty apples and oranges? I want to head healthy food", 
        "The songs from this bend are boring, lets play some other music from some good bands", 
        "If you buy this now, you will get 3 different products for free in the next 10 days.", 
        "I am living in a small house in France, and my wish is to learn how to ski and snowboad",
        "It is time to invest in some tech stock. The stock market is will become very hot in the next few months",
        "This player won all 4 grand slam tournaments last year. He is the best player in the world!"]

df = pd.DataFrame({"TEXT": my_list})

final_list = []
for text in df.TEXT:
    
    blob = TextBlob(text)
    result_list = blob.noun_phrases
    print(result_list)
    final_list.extend(result_list)

print(final_list)

I know that I can create a CountVectorizer like this when I'm working with Sciki-Learn:

features = df.iloc[:, :-1]
results = df.iloc[:, -1]

# vectorizer
transformerVectoriser = ColumnTransformer(transformers=[('vector title', CountVectorizer(analyzer='word', ngram_range=(2, 4), max_features = 1000, stop_words = 'english'), 'TEXT')])

clf = RandomForestClassifier(max_depth = 75, n_estimators = 125, random_state = 42)
pipeline = Pipeline([('transformer', transformerVectoriser),
                     ('classifier', clf)])


cv_score_acc = cross_val_score(pipeline, features, results, cv=5, scoring = 'accuracy')

But how can I create a vectorizer from phrases that I extracted previously? For example, extracted phrases from the text that is in my_list are:

['nice bike', 'old bike', 'great time', 'tasty apples', 'healthy food', 'good bands', 'different products', 'small house', 'france', 'tech stock', 'stock market', 'grand slam tournaments']

How to create custom count vectorizer were features would be the phrases that I listed above?

If you initialize CountVectorizer(vocabulary=noun_phrases, ...) you should get the desired effect:

noun_phrases = ['nice bike', 'old bike', 'great time', 'tasty apples', 'healthy food', 'good bands', 'different products', 'small house', 'france', 'tech stock', 'stock market', 'grand slam tournaments']

cv = CountVectorizer(analyzer='word', vocabulary=noun_phrases, ngram_range=(2, 4))
res = cv.transform(my_list)
res.todense()

>>>
matrix([[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])

You can customize the tokenizer function of sklearn CountVectorizer

def noun_phrases_tokenizer(text):
    return TextBlob(text).noun_phrases
    
count_vectorizer = CountVectorizer(tokenizer=noun_phrases_tokenizer)
transformerVectoriser = ColumnTransformer(transformers=[('count', count_vectorizer, 'TEXT')])
transformerVectoriser.fit_transform(df)

print(transformerVectoriser.transformers_[0][1].get_feature_names_out())
# ['different products', 'good bands', 'grand slam tournaments', ...]

Update: Add lemmatization

import textblob

def lemmatize_noun_phrase(phrase):
    # phrase.lemmatize() not working correctly
    return " ".join([textblob.Word(w).lemmatize() for w in phrase.split(" ")])

def custom_tokenizer(text):
    phrases = textblob.TextBlob(text).noun_phrases
    return [lemmatize_noun_phrase(p) for p in phrases]

print(noun_phrases_tokenizer("I love green apples"))  # ["green apple"]
count_vectorizer = CountVectorizer(tokenizer=custom_tokenizer)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM