使用 Scikit-Learn 创建自定义计数向量器

Question

I want to create a custom CountVectorizer with Python and Scikit-Learn lib.我想用 Python 和 Scikit-Learn 库创建一个自定义 CountVectorizer。 I wrote a code that extracts phrases from Pandas dataframe with TextBlob lib, and I want from my Vecotrizer to count those phrases.我编写了一个代码，使用 TextBlob 库从 Pandas dataframe 中提取短语，我想从我的 Vecotrizer 中计算这些短语。

My code:我的代码：

from textblob import TextBlob
import pandas as pd

my_list = ["I want to buy a nice bike for my girl. She broke her old bike last year.", 
        "I had a great time watching that movie last night. We shouuld do the same next week", 
        "Where can I buy some tasty apples and oranges? I want to head healthy food", 
        "The songs from this bend are boring, lets play some other music from some good bands", 
        "If you buy this now, you will get 3 different products for free in the next 10 days.", 
        "I am living in a small house in France, and my wish is to learn how to ski and snowboad",
        "It is time to invest in some tech stock. The stock market is will become very hot in the next few months",
        "This player won all 4 grand slam tournaments last year. He is the best player in the world!"]

df = pd.DataFrame({"TEXT": my_list})

final_list = []
for text in df.TEXT:
    
    blob = TextBlob(text)
    result_list = blob.noun_phrases
    print(result_list)
    final_list.extend(result_list)

print(final_list)

I know that I can create a CountVectorizer like this when I'm working with Sciki-Learn:我知道当我使用 Sciki-Learn 时，我可以像这样创建一个 CountVectorizer：

features = df.iloc[:, :-1]
results = df.iloc[:, -1]

# vectorizer
transformerVectoriser = ColumnTransformer(transformers=[('vector title', CountVectorizer(analyzer='word', ngram_range=(2, 4), max_features = 1000, stop_words = 'english'), 'TEXT')])

clf = RandomForestClassifier(max_depth = 75, n_estimators = 125, random_state = 42)
pipeline = Pipeline([('transformer', transformerVectoriser),
                     ('classifier', clf)])


cv_score_acc = cross_val_score(pipeline, features, results, cv=5, scoring = 'accuracy')

But how can I create a vectorizer from phrases that I extracted previously?但是我怎样才能从我之前提取的短语创建一个矢量化器呢？ For example, extracted phrases from the text that is in my_list are:例如，从my_list中的文本中提取的短语是：

['nice bike', 'old bike', 'great time', 'tasty apples', 'healthy food', 'good bands', 'different products', 'small house', 'france', 'tech stock', 'stock market', 'grand slam tournaments']

How to create custom count vectorizer were features would be the phrases that I listed above?如何创建自定义计数矢量化器的特征是我上面列出的短语？

Answer 1

If you initialize CountVectorizer(vocabulary=noun_phrases, ...) you should get the desired effect:如果你初始化CountVectorizer(vocabulary=noun_phrases, ...)你应该得到预期的效果：

noun_phrases = ['nice bike', 'old bike', 'great time', 'tasty apples', 'healthy food', 'good bands', 'different products', 'small house', 'france', 'tech stock', 'stock market', 'grand slam tournaments']

cv = CountVectorizer(analyzer='word', vocabulary=noun_phrases, ngram_range=(2, 4))
res = cv.transform(my_list)
res.todense()

>>>
matrix([[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])

Answer 2

You can customize the tokenizer function of sklearn CountVectorizer可以自定义sklearn CountVectorizer的tokenizer function

def noun_phrases_tokenizer(text):
    return TextBlob(text).noun_phrases
    
count_vectorizer = CountVectorizer(tokenizer=noun_phrases_tokenizer)
transformerVectoriser = ColumnTransformer(transformers=[('count', count_vectorizer, 'TEXT')])
transformerVectoriser.fit_transform(df)

print(transformerVectoriser.transformers_[0][1].get_feature_names_out())
# ['different products', 'good bands', 'grand slam tournaments', ...]

Update: Add lemmatization更新：添加词形还原

import textblob

def lemmatize_noun_phrase(phrase):
    # phrase.lemmatize() not working correctly
    return " ".join([textblob.Word(w).lemmatize() for w in phrase.split(" ")])

def custom_tokenizer(text):
    phrases = textblob.TextBlob(text).noun_phrases
    return [lemmatize_noun_phrase(p) for p in phrases]

print(noun_phrases_tokenizer("I love green apples"))  # ["green apple"]
count_vectorizer = CountVectorizer(tokenizer=custom_tokenizer)

使用 Scikit-Learn 创建自定义计数向量器

问题描述

2 个解决方案

解决方案1
2 已采纳 2022-06-08 20:54:07

解决方案2
1 2022-06-08 17:23:12

使用 Scikit-Learn 创建自定义计数向量器

问题描述

2 个解决方案

解决方案1 2 已采纳 2022-06-08 20:54:07

解决方案2 1 2022-06-08 17:23:12

解决方案1
2 已采纳 2022-06-08 20:54:07

解决方案2
1 2022-06-08 17:23:12