简体   繁体   English

使用 Scikit-Learn 创建自定义计数向量器

[英]Creating custom Count Vectorizer with Scikit-Learn

I want to create a custom CountVectorizer with Python and Scikit-Learn lib.我想用 Python 和 Scikit-Learn 库创建一个自定义 CountVectorizer。 I wrote a code that extracts phrases from Pandas dataframe with TextBlob lib, and I want from my Vecotrizer to count those phrases.我编写了一个代码,使用 TextBlob 库从 Pandas dataframe 中提取短语,我想从我的 Vecotrizer 中计算这些短语。

My code:我的代码:

from textblob import TextBlob
import pandas as pd

my_list = ["I want to buy a nice bike for my girl. She broke her old bike last year.", 
        "I had a great time watching that movie last night. We shouuld do the same next week", 
        "Where can I buy some tasty apples and oranges? I want to head healthy food", 
        "The songs from this bend are boring, lets play some other music from some good bands", 
        "If you buy this now, you will get 3 different products for free in the next 10 days.", 
        "I am living in a small house in France, and my wish is to learn how to ski and snowboad",
        "It is time to invest in some tech stock. The stock market is will become very hot in the next few months",
        "This player won all 4 grand slam tournaments last year. He is the best player in the world!"]

df = pd.DataFrame({"TEXT": my_list})

final_list = []
for text in df.TEXT:
    
    blob = TextBlob(text)
    result_list = blob.noun_phrases
    print(result_list)
    final_list.extend(result_list)

print(final_list)

I know that I can create a CountVectorizer like this when I'm working with Sciki-Learn:我知道当我使用 Sciki-Learn 时,我可以像这样创建一个 CountVectorizer:

features = df.iloc[:, :-1]
results = df.iloc[:, -1]

# vectorizer
transformerVectoriser = ColumnTransformer(transformers=[('vector title', CountVectorizer(analyzer='word', ngram_range=(2, 4), max_features = 1000, stop_words = 'english'), 'TEXT')])

clf = RandomForestClassifier(max_depth = 75, n_estimators = 125, random_state = 42)
pipeline = Pipeline([('transformer', transformerVectoriser),
                     ('classifier', clf)])


cv_score_acc = cross_val_score(pipeline, features, results, cv=5, scoring = 'accuracy')

But how can I create a vectorizer from phrases that I extracted previously?但是我怎样才能从我之前提取的短语创建一个矢量化器呢? For example, extracted phrases from the text that is in my_list are:例如,从my_list中的文本中提取的短语是:

['nice bike', 'old bike', 'great time', 'tasty apples', 'healthy food', 'good bands', 'different products', 'small house', 'france', 'tech stock', 'stock market', 'grand slam tournaments']

How to create custom count vectorizer were features would be the phrases that I listed above?如何创建自定义计数矢量化器的特征是我上面列出的短语?

If you initialize CountVectorizer(vocabulary=noun_phrases, ...) you should get the desired effect:如果你初始化CountVectorizer(vocabulary=noun_phrases, ...)你应该得到预期的效果:

noun_phrases = ['nice bike', 'old bike', 'great time', 'tasty apples', 'healthy food', 'good bands', 'different products', 'small house', 'france', 'tech stock', 'stock market', 'grand slam tournaments']

cv = CountVectorizer(analyzer='word', vocabulary=noun_phrases, ngram_range=(2, 4))
res = cv.transform(my_list)
res.todense()

>>>
matrix([[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])

You can customize the tokenizer function of sklearn CountVectorizer可以自定义sklearn CountVectorizer的tokenizer function

def noun_phrases_tokenizer(text):
    return TextBlob(text).noun_phrases
    
count_vectorizer = CountVectorizer(tokenizer=noun_phrases_tokenizer)
transformerVectoriser = ColumnTransformer(transformers=[('count', count_vectorizer, 'TEXT')])
transformerVectoriser.fit_transform(df)

print(transformerVectoriser.transformers_[0][1].get_feature_names_out())
# ['different products', 'good bands', 'grand slam tournaments', ...]

Update: Add lemmatization更新:添加词形还原

import textblob

def lemmatize_noun_phrase(phrase):
    # phrase.lemmatize() not working correctly
    return " ".join([textblob.Word(w).lemmatize() for w in phrase.split(" ")])

def custom_tokenizer(text):
    phrases = textblob.TextBlob(text).noun_phrases
    return [lemmatize_noun_phrase(p) for p in phrases]

print(noun_phrases_tokenizer("I love green apples"))  # ["green apple"]
count_vectorizer = CountVectorizer(tokenizer=custom_tokenizer)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 spaCy 和 scikit-learn 向量化器 - spaCy and scikit-learn vectorizer 从scikit学习和计算矢量化器创建ngrams会引发内存错误 - Creating ngrams from scikit learn and count vectorizer throws Memory Error 在scikit-learn中向文本矢量化器添加新单词 - Adding new words to text vectorizer in scikit-learn 如何告诉scikit-learn vectorizer使用特定功能? - How to tell scikit-learn vectorizer use specific features? SciKit-Learn 中的 TFIDF 矢量化器仅返回 5 个结果 - TFIDF Vectorizer within SciKit-Learn only returning 5 results scikit学习矢量化器词汇,具有映射到同一索引的多个术语 - scikit-learn vectorizer vocabulary with multiple terms mapping to same index scikit-learn 矢量化器如何处理标点符号 - how does scikit-learn vectorizer handle punctuation 如何保存使用矢量化器、管道和 GridSearchV 的 scikit-learn 分类器? - How to save a scikit-learn classifier that utilizes a vectorizer, a pipeline and GridSearchV? 无法通过断言检查 [创建自定义转换器,scikit-learn] - Unable to Pass Assertion Check [Creating custom transformer, scikit-learn] 在 scikit-learn 中创建一个添加集群标签的自定义转换器 - Creating a custom transformer in scikit-learn that adds cluster labels
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM