[英]Calculate IDF (Inverse Document Frequency) on a pandas dataframe
I have a data frame df with three columns as shown below: 我有一个包含三列的数据框df,如下所示:
DocumentID Words Region
1 ['A','B','C'] ['Canada']
2 ['A','X','D'] ['India', 'USA', 'Canada']
3 ['B','C','X'] ['Canada']
I want to calculate IDF for each word in the "Words" column ie I want to generate an output which has each word like 'A','B','C' etc with its corresponding IDF value. 我想为“单词”(Words)列中的每个单词计算IDF,即我想生成一个输出,该输出包含每个单词(如“ A”,“ B”,“ C”等)及其相应的IDF值。
Here's a slightly less specific version. 这是一个不太具体的版本。 Assuming you want the standard 1/df definition of IDF, you can iterate through each "document" in the Words
column counting: 假设您想要IDF的标准1 / df定义,可以遍历Words
列中的每个“文档”,并进行计数:
from collections import defaultdict
# Assuming the Words column is represented as you presented it:
words = [['A','B','C'],
['A','X','D'],
['B','C','X']]
# to store intermediate counts:
idf = defaultdict(float)
for doc in words:
for w in doc:
idf[w] += 1
# Compute IDF as 1/df :
idf = {k:(1/v) for (k,v) in idf.items()} #<- {'A': 0.5, 'B': 0.5,'C': 0.5, 'D': 1.0, 'X': 0.5}
vocab = idf.keys() # Note that the vocab is also accessible now.
list_words = []
list_regions = []
for words in df['Words']:
for word in words:
list_words.append(word)
for regions in df['Region']:
for region in regions:
list_regions.append(region)
IDF_words = pd.DataFrame([], columns=['words','IDF'])
IDF_regions = pd.DataFrame([], columns=['regions','IDF'])
IDF_words['words'] = sorted(set(list_words))
IDF_regions['regions'] = sorted(set(list_regions))
IDF_words['IDF'] = IDF_words['words'].map(lambda x: list_words.count(x)/float(len(list_words)))
IDF_regions['IDF'] = IDF_regions['regions'].map(lambda x: list_regions.count(x)/float(len(list_regions)))
hope it helps bro! 希望它对兄弟有帮助!
if it does pls upvote/mark answered :) 如果确实请upvote /标记回答:)
peace 和平
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.