简体   繁体   中英

How to apply TF-IDF on pandas column based on colon delimiter in text data

I have a column in pandas dataframe where I capture a visitor's journey. I want to implement TF-IDF on this text column. Here is the sample data -

df = pd.DataFrame({'id': [10, 11, 12]
                   , 'pagename': ['home:cart:checkout:buy:home','home:cart:cart:home','home:account:home']})

Below is how df looks like - 在此处输入图像描述

I want to now apply tf-idf technique where my words are separated by a delimiter like : . When I try below code, it does not work -

from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_model =  TfidfVectorizer(lowercase=False, use_idf=True,token_pattern=':')
tf_idf_df = tf_idf_model.fit_transform(df['pagename'])

tf_idf_model.get_feature_names() prints out [':']

How can I achieve tf-idf on this pagename column so that in my output I get columns such as home , cart , account , checkout , buy with their corresponding weights?

I think I figured it out -

I had to use a custom tokenizer to solve this problem -

Here is my code that works now -

def tokens(x): return x.split(':')
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect= TfidfVectorizer( tokenizer=tokens 
                            ,use_idf=True
                            , smooth_idf=True
                            , min_df = 100
                            , stop_words = 'english'
                            , max_features = 10
                            , sublinear_tf=False)

And then I can compute the transformed data like this below -

tf_idf_matrix = pd.DataFrame(
    tfidf_vect.fit_transform(df['pagename']).toarray(), 
    columns=tfidf_vect.get_feature_names()
)

Here is the final output -

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM