I have a column in pandas dataframe where I capture a visitor's journey. I want to implement TF-IDF on this text column. Here is the sample data -
df = pd.DataFrame({'id': [10, 11, 12]
, 'pagename': ['home:cart:checkout:buy:home','home:cart:cart:home','home:account:home']})
I want to now apply tf-idf
technique where my words are separated by a delimiter like :
. When I try below code, it does not work -
from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_model = TfidfVectorizer(lowercase=False, use_idf=True,token_pattern=':')
tf_idf_df = tf_idf_model.fit_transform(df['pagename'])
tf_idf_model.get_feature_names() prints out [':']
How can I achieve tf-idf on this pagename column so that in my output I get columns such as home
, cart
, account
, checkout
, buy
with their corresponding weights?
I think I figured it out -
I had to use a custom tokenizer to solve this problem -
Here is my code that works now -
def tokens(x): return x.split(':')
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect= TfidfVectorizer( tokenizer=tokens
,use_idf=True
, smooth_idf=True
, min_df = 100
, stop_words = 'english'
, max_features = 10
, sublinear_tf=False)
And then I can compute the transformed data like this below -
tf_idf_matrix = pd.DataFrame(
tfidf_vect.fit_transform(df['pagename']).toarray(),
columns=tfidf_vect.get_feature_names()
)
Here is the final output -
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.