I reading some text data from a csv and trying to build a TF-IDF feature vector using those data.
The data looks something like:
where the content contains specially formatted strings (synset).
When I try to build a TF-IDF vector with that, I am expecting to preserve that format, but when I do
tfidf = TfidfVectorizer()
data['content'] = data['content'].fillna('')
tfidf_matrix = tfidf.fit_transform(data['content'])
and look at the tfidf.vocabulary_
The text data is preprocessed as:
{'square': 3754,
'01': 0,
'02': 1,
'public_square': 3137,
'04': 3,
'05': 4,
'06': 5,
'07': 6,
'08': 7,
'03': 2,
'feather': 1666,
'straight': 3821,...
I want it to count square.n.01
as a single text instead of splitting it up.
I would be able to do this if I build TF-IDF from scratch, but I feel like that is unnecessary. Any help?
you need to write your own tokenization function which need to be called in tokenizer parameter of tfidfVectorizer
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.DataFrame(data = [[['square.n.01','square.n.02','public_square.n.01']],
[['two.n.01','deuce.n.04','two.s.01']]], columns = ['content'])
df['content'] = df['content'].astype(str)
df['content'] = df['content'].apply(lambda x: x.replace('[','').replace(']',''))
def my_tokenizer(doc):
return doc.split(',')
tfidf = TfidfVectorizer(tokenizer = my_tokenizer)
tfidf_matrix = tfidf.fit_transform(df['content'])
print(tfidf.vocabulary_)
#o/p
{"'square.n.01'": 4,
" 'square.n.02'": 2,
" 'public_square.n.01'": 1,
"'two.n.01'": 5,
" 'deuce.n.04'": 0,
" 'two.s.01'": 3}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.