I want to create a custom CountVectorizer with Python and Scikit-Learn lib. I wrote a code that extracts phrases from Pandas dataframe with TextBlob lib, and I want from my Vecotrizer to count those phrases.
My code:
from textblob import TextBlob
import pandas as pd
my_list = ["I want to buy a nice bike for my girl. She broke her old bike last year.",
"I had a great time watching that movie last night. We shouuld do the same next week",
"Where can I buy some tasty apples and oranges? I want to head healthy food",
"The songs from this bend are boring, lets play some other music from some good bands",
"If you buy this now, you will get 3 different products for free in the next 10 days.",
"I am living in a small house in France, and my wish is to learn how to ski and snowboad",
"It is time to invest in some tech stock. The stock market is will become very hot in the next few months",
"This player won all 4 grand slam tournaments last year. He is the best player in the world!"]
df = pd.DataFrame({"TEXT": my_list})
final_list = []
for text in df.TEXT:
blob = TextBlob(text)
result_list = blob.noun_phrases
print(result_list)
final_list.extend(result_list)
print(final_list)
I know that I can create a CountVectorizer like this when I'm working with Sciki-Learn:
features = df.iloc[:, :-1]
results = df.iloc[:, -1]
# vectorizer
transformerVectoriser = ColumnTransformer(transformers=[('vector title', CountVectorizer(analyzer='word', ngram_range=(2, 4), max_features = 1000, stop_words = 'english'), 'TEXT')])
clf = RandomForestClassifier(max_depth = 75, n_estimators = 125, random_state = 42)
pipeline = Pipeline([('transformer', transformerVectoriser),
('classifier', clf)])
cv_score_acc = cross_val_score(pipeline, features, results, cv=5, scoring = 'accuracy')
But how can I create a vectorizer from phrases that I extracted previously? For example, extracted phrases from the text that is in my_list
are:
['nice bike', 'old bike', 'great time', 'tasty apples', 'healthy food', 'good bands', 'different products', 'small house', 'france', 'tech stock', 'stock market', 'grand slam tournaments']
How to create custom count vectorizer were features would be the phrases that I listed above?
If you initialize CountVectorizer(vocabulary=noun_phrases, ...)
you should get the desired effect:
noun_phrases = ['nice bike', 'old bike', 'great time', 'tasty apples', 'healthy food', 'good bands', 'different products', 'small house', 'france', 'tech stock', 'stock market', 'grand slam tournaments']
cv = CountVectorizer(analyzer='word', vocabulary=noun_phrases, ngram_range=(2, 4))
res = cv.transform(my_list)
res.todense()
>>>
matrix([[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])
You can customize the tokenizer function of sklearn CountVectorizer
def noun_phrases_tokenizer(text):
return TextBlob(text).noun_phrases
count_vectorizer = CountVectorizer(tokenizer=noun_phrases_tokenizer)
transformerVectoriser = ColumnTransformer(transformers=[('count', count_vectorizer, 'TEXT')])
transformerVectoriser.fit_transform(df)
print(transformerVectoriser.transformers_[0][1].get_feature_names_out())
# ['different products', 'good bands', 'grand slam tournaments', ...]
Update: Add lemmatization
import textblob
def lemmatize_noun_phrase(phrase):
# phrase.lemmatize() not working correctly
return " ".join([textblob.Word(w).lemmatize() for w in phrase.split(" ")])
def custom_tokenizer(text):
phrases = textblob.TextBlob(text).noun_phrases
return [lemmatize_noun_phrase(p) for p in phrases]
print(noun_phrases_tokenizer("I love green apples")) # ["green apple"]
count_vectorizer = CountVectorizer(tokenizer=custom_tokenizer)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.