简体   繁体   中英

How can I make Scikit-learn TfidfVectorizer not to preprocess the text?

I reading some text data from a csv and trying to build a TF-IDF feature vector using those data.

The data looks something like:

在此处输入图片说明

where the content contains specially formatted strings (synset).

When I try to build a TF-IDF vector with that, I am expecting to preserve that format, but when I do

tfidf = TfidfVectorizer()
data['content'] = data['content'].fillna('')
tfidf_matrix = tfidf.fit_transform(data['content'])

and look at the tfidf.vocabulary_

The text data is preprocessed as:

{'square': 3754,
 '01': 0,
 '02': 1,
 'public_square': 3137,
 '04': 3,
 '05': 4,
 '06': 5,
 '07': 6,
 '08': 7,
 '03': 2,
 'feather': 1666,
 'straight': 3821,...

I want it to count square.n.01 as a single text instead of splitting it up.

I would be able to do this if I build TF-IDF from scratch, but I feel like that is unnecessary. Any help?

you need to write your own tokenization function which need to be called in tokenizer parameter of tfidfVectorizer

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer


df = pd.DataFrame(data = [[['square.n.01','square.n.02','public_square.n.01']], 
[['two.n.01','deuce.n.04','two.s.01']]], columns = ['content'])

df['content'] = df['content'].astype(str)
df['content'] = df['content'].apply(lambda x: x.replace('[','').replace(']',''))

def my_tokenizer(doc):
    return doc.split(',')

tfidf = TfidfVectorizer(tokenizer = my_tokenizer)
tfidf_matrix = tfidf.fit_transform(df['content'])

print(tfidf.vocabulary_)
#o/p
{"'square.n.01'": 4,
 " 'square.n.02'": 2,
 " 'public_square.n.01'": 1,
 "'two.n.01'": 5,
 " 'deuce.n.04'": 0,
 " 'two.s.01'": 3}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM