简体   繁体   中英

How to apply tf-idf to rows of text

I have rows of blurbs (in text format) and I want to use tf-idf to define the weight of each word. Below is the code:

def remove_punctuations(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text
df["punc_blurb"] = df["blurb"].apply(remove_punctuations)

df = pd.DataFrame(df["punc_blurb"])

vectoriser = TfidfVectorizer()
df["blurb_Vect"] = list(vectoriser.fit_transform(df["punc_blurb"]).toarray())

df_vectoriser = pd.DataFrame(x.toarray(),
columns = vectoriser.get_feature_names())
print(df_vectoriser)

All I get is a massive list of numbers, which I am not even sure anymore if its the TF or TF-IDF that it is giving me as the frequent words (the, and, etc) all have a score of more than 0.

The goal is to see the weights in the tf-idf column shown below and I am unsure if I am doing this in the most efficient way:

Goal Output table

You don't need punctuation remover if you use TfidfVectorizer . It will take care of punctuation automatically, by virtue of default token_pattern param:

from sklearn.feature_extraction.text import TfidfVectorizer

df = pd.DataFrame({"blurb":["this is a sentence", "this is, well, another one"]})
vectorizer = TfidfVectorizer(token_pattern='(?u)\\b\\w\\w+\\b')
df["tf_idf"] = list(vectorizer.fit_transform(df["blurb"].values.astype("U")).toarray())
vocab = sorted(vectorizer.vocabulary_.keys())
df["tf_idf_dic"] = df["tf_idf"].apply(lambda x: {k:v for k,v in dict(zip(vocab,x)).items() if v!=0})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM