[英]Creating custom Count Vectorizer with Scikit-Learn
I want to create a custom CountVectorizer with Python and Scikit-Learn lib.我想用 Python 和 Scikit-Learn 库创建一个自定义 CountVectorizer。 I wrote a code that extracts phrases from Pandas dataframe with TextBlob lib, and I want from my Vecotrizer to count those phrases.
我编写了一个代码,使用 TextBlob 库从 Pandas dataframe 中提取短语,我想从我的 Vecotrizer 中计算这些短语。
My code:我的代码:
from textblob import TextBlob
import pandas as pd
my_list = ["I want to buy a nice bike for my girl. She broke her old bike last year.",
"I had a great time watching that movie last night. We shouuld do the same next week",
"Where can I buy some tasty apples and oranges? I want to head healthy food",
"The songs from this bend are boring, lets play some other music from some good bands",
"If you buy this now, you will get 3 different products for free in the next 10 days.",
"I am living in a small house in France, and my wish is to learn how to ski and snowboad",
"It is time to invest in some tech stock. The stock market is will become very hot in the next few months",
"This player won all 4 grand slam tournaments last year. He is the best player in the world!"]
df = pd.DataFrame({"TEXT": my_list})
final_list = []
for text in df.TEXT:
blob = TextBlob(text)
result_list = blob.noun_phrases
print(result_list)
final_list.extend(result_list)
print(final_list)
I know that I can create a CountVectorizer like this when I'm working with Sciki-Learn:我知道当我使用 Sciki-Learn 时,我可以像这样创建一个 CountVectorizer:
features = df.iloc[:, :-1]
results = df.iloc[:, -1]
# vectorizer
transformerVectoriser = ColumnTransformer(transformers=[('vector title', CountVectorizer(analyzer='word', ngram_range=(2, 4), max_features = 1000, stop_words = 'english'), 'TEXT')])
clf = RandomForestClassifier(max_depth = 75, n_estimators = 125, random_state = 42)
pipeline = Pipeline([('transformer', transformerVectoriser),
('classifier', clf)])
cv_score_acc = cross_val_score(pipeline, features, results, cv=5, scoring = 'accuracy')
But how can I create a vectorizer from phrases that I extracted previously?但是我怎样才能从我之前提取的短语创建一个矢量化器呢? For example, extracted phrases from the text that is in
my_list
are:例如,从
my_list
中的文本中提取的短语是:
['nice bike', 'old bike', 'great time', 'tasty apples', 'healthy food', 'good bands', 'different products', 'small house', 'france', 'tech stock', 'stock market', 'grand slam tournaments']
How to create custom count vectorizer were features would be the phrases that I listed above?如何创建自定义计数矢量化器的特征是我上面列出的短语?
If you initialize CountVectorizer(vocabulary=noun_phrases, ...)
you should get the desired effect:如果你初始化
CountVectorizer(vocabulary=noun_phrases, ...)
你应该得到预期的效果:
noun_phrases = ['nice bike', 'old bike', 'great time', 'tasty apples', 'healthy food', 'good bands', 'different products', 'small house', 'france', 'tech stock', 'stock market', 'grand slam tournaments']
cv = CountVectorizer(analyzer='word', vocabulary=noun_phrases, ngram_range=(2, 4))
res = cv.transform(my_list)
res.todense()
>>>
matrix([[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])
You can customize the tokenizer function of sklearn CountVectorizer可以自定义sklearn CountVectorizer的tokenizer function
def noun_phrases_tokenizer(text):
return TextBlob(text).noun_phrases
count_vectorizer = CountVectorizer(tokenizer=noun_phrases_tokenizer)
transformerVectoriser = ColumnTransformer(transformers=[('count', count_vectorizer, 'TEXT')])
transformerVectoriser.fit_transform(df)
print(transformerVectoriser.transformers_[0][1].get_feature_names_out())
# ['different products', 'good bands', 'grand slam tournaments', ...]
Update: Add lemmatization更新:添加词形还原
import textblob
def lemmatize_noun_phrase(phrase):
# phrase.lemmatize() not working correctly
return " ".join([textblob.Word(w).lemmatize() for w in phrase.split(" ")])
def custom_tokenizer(text):
phrases = textblob.TextBlob(text).noun_phrases
return [lemmatize_noun_phrase(p) for p in phrases]
print(noun_phrases_tokenizer("I love green apples")) # ["green apple"]
count_vectorizer = CountVectorizer(tokenizer=custom_tokenizer)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.