如何使Scikit學習TfidfVectorizer不預處理文本？

Question

我從csv中讀取了一些文本數據，並嘗試使用這些數據構建TF-IDF特征向量。

數據如下所示：

內容包含特殊格式的字符串（同義詞集）。

當我嘗試以此構建TF-IDF向量時，我希望保留該格式，但是當我這樣做時

tfidf = TfidfVectorizer()
data['content'] = data['content'].fillna('')
tfidf_matrix = tfidf.fit_transform(data['content'])

看看tfidf.vocabulary_

文本數據被預處理為：

{'square': 3754,
 '01': 0,
 '02': 1,
 'public_square': 3137,
 '04': 3,
 '05': 4,
 '06': 5,
 '07': 6,
 '08': 7,
 '03': 2,
 'feather': 1666,
 'straight': 3821,...

我希望它將square.n.01視為單個文本，而不是將其拆分。

如果我從頭開始構建TF-IDF，我將能夠做到這一點，但是我覺得那是不必要的。 有什么幫助嗎？

Answer 1

您需要編寫自己的令牌化函數，需要在tfidfVectorizer的tokenizer參數中調用

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer


df = pd.DataFrame(data = [[['square.n.01','square.n.02','public_square.n.01']], 
[['two.n.01','deuce.n.04','two.s.01']]], columns = ['content'])

df['content'] = df['content'].astype(str)
df['content'] = df['content'].apply(lambda x: x.replace('[','').replace(']',''))

def my_tokenizer(doc):
    return doc.split(',')

tfidf = TfidfVectorizer(tokenizer = my_tokenizer)
tfidf_matrix = tfidf.fit_transform(df['content'])

print(tfidf.vocabulary_)
#o/p
{"'square.n.01'": 4,
 " 'square.n.02'": 2,
 " 'public_square.n.01'": 1,
 "'two.n.01'": 5,
 " 'deuce.n.04'": 0,
 " 'two.s.01'": 3}

如何使Scikit學習TfidfVectorizer不預處理文本？

問題描述

1 個解決方案

解決方案1
1 已采納 2019-06-20 13:18:15

如何使Scikit學習TfidfVectorizer不預處理文本？

問題描述

1 個解決方案

解決方案1 1 已采納 2019-06-20 13:18:15

解決方案1
1 已采納 2019-06-20 13:18:15