![](/img/trans.png)
[英]How to preserve punctuation marks in Scikit-Learn text CountVectorizer or TfidfVectorizer?
[英]How can I make Scikit-learn TfidfVectorizer not to preprocess the text?
我從csv中讀取了一些文本數據,並嘗試使用這些數據構建TF-IDF特征向量。
數據如下所示:
內容包含特殊格式的字符串(同義詞集)。
當我嘗試以此構建TF-IDF向量時,我希望保留該格式,但是當我這樣做時
tfidf = TfidfVectorizer()
data['content'] = data['content'].fillna('')
tfidf_matrix = tfidf.fit_transform(data['content'])
看看tfidf.vocabulary_
文本數據被預處理為:
{'square': 3754,
'01': 0,
'02': 1,
'public_square': 3137,
'04': 3,
'05': 4,
'06': 5,
'07': 6,
'08': 7,
'03': 2,
'feather': 1666,
'straight': 3821,...
我希望它將square.n.01
視為單個文本,而不是將其拆分。
如果我從頭開始構建TF-IDF,我將能夠做到這一點,但是我覺得那是不必要的。 有什么幫助嗎?
您需要編寫自己的令牌化函數,需要在tfidfVectorizer的tokenizer參數中調用
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.DataFrame(data = [[['square.n.01','square.n.02','public_square.n.01']],
[['two.n.01','deuce.n.04','two.s.01']]], columns = ['content'])
df['content'] = df['content'].astype(str)
df['content'] = df['content'].apply(lambda x: x.replace('[','').replace(']',''))
def my_tokenizer(doc):
return doc.split(',')
tfidf = TfidfVectorizer(tokenizer = my_tokenizer)
tfidf_matrix = tfidf.fit_transform(df['content'])
print(tfidf.vocabulary_)
#o/p
{"'square.n.01'": 4,
" 'square.n.02'": 2,
" 'public_square.n.01'": 1,
"'two.n.01'": 5,
" 'deuce.n.04'": 0,
" 'two.s.01'": 3}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.