[英]Calculate IDF (Inverse Document Frequency) on a pandas dataframe
我有一個包含三列的數據框df,如下所示:
DocumentID Words Region
1 ['A','B','C'] ['Canada']
2 ['A','X','D'] ['India', 'USA', 'Canada']
3 ['B','C','X'] ['Canada']
我想為“單詞”(Words)列中的每個單詞計算IDF,即我想生成一個輸出,該輸出包含每個單詞(如“ A”,“ B”,“ C”等)及其相應的IDF值。
這是一個不太具體的版本。 假設您想要IDF的標准1 / df定義,可以遍歷Words
列中的每個“文檔”,並進行計數:
from collections import defaultdict
# Assuming the Words column is represented as you presented it:
words = [['A','B','C'],
['A','X','D'],
['B','C','X']]
# to store intermediate counts:
idf = defaultdict(float)
for doc in words:
for w in doc:
idf[w] += 1
# Compute IDF as 1/df :
idf = {k:(1/v) for (k,v) in idf.items()} #<- {'A': 0.5, 'B': 0.5,'C': 0.5, 'D': 1.0, 'X': 0.5}
vocab = idf.keys() # Note that the vocab is also accessible now.
list_words = []
list_regions = []
for words in df['Words']:
for word in words:
list_words.append(word)
for regions in df['Region']:
for region in regions:
list_regions.append(region)
IDF_words = pd.DataFrame([], columns=['words','IDF'])
IDF_regions = pd.DataFrame([], columns=['regions','IDF'])
IDF_words['words'] = sorted(set(list_words))
IDF_regions['regions'] = sorted(set(list_regions))
IDF_words['IDF'] = IDF_words['words'].map(lambda x: list_words.count(x)/float(len(list_words)))
IDF_regions['IDF'] = IDF_regions['regions'].map(lambda x: list_regions.count(x)/float(len(list_regions)))
希望它對兄弟有幫助!
如果確實請upvote /標記回答:)
和平
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.