I have rows of blurbs (in text format) and I want to use tf-idf to define the weight of each word. Below is the code:
def remove_punctuations(text):
for punctuation in string.punctuation:
text = text.replace(punctuation, '')
return text
df["punc_blurb"] = df["blurb"].apply(remove_punctuations)
df = pd.DataFrame(df["punc_blurb"])
vectoriser = TfidfVectorizer()
df["blurb_Vect"] = list(vectoriser.fit_transform(df["punc_blurb"]).toarray())
df_vectoriser = pd.DataFrame(x.toarray(),
columns = vectoriser.get_feature_names())
print(df_vectoriser)
All I get is a massive list of numbers, which I am not even sure anymore if its the TF or TF-IDF that it is giving me as the frequent words (the, and, etc) all have a score of more than 0.
The goal is to see the weights in the tf-idf column shown below and I am unsure if I am doing this in the most efficient way:
You don't need punctuation remover if you use TfidfVectorizer
. It will take care of punctuation automatically, by virtue of default token_pattern
param:
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.DataFrame({"blurb":["this is a sentence", "this is, well, another one"]})
vectorizer = TfidfVectorizer(token_pattern='(?u)\\b\\w\\w+\\b')
df["tf_idf"] = list(vectorizer.fit_transform(df["blurb"].values.astype("U")).toarray())
vocab = sorted(vectorizer.vocabulary_.keys())
df["tf_idf_dic"] = df["tf_idf"].apply(lambda x: {k:v for k,v in dict(zip(vocab,x)).items() if v!=0})
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.