[英]How to store a calculated value in new column by iterating through each row in a dataframe in Python?
The dataframe I am working with looks like this:我正在使用的 dataframe 如下所示:
vid2 FStart FEnd cap2 VDuration COS cap1
0 -_aaMGK6GGw_57_61 0 3 A man grabbed a boy from his collar and threw ... 4 2 A man and woman are yelling at a young boy and...
1 -_aaMGK6GGw_57_61 3 4 A lady is waking up a man lying on a chair and... 4 2 A man and woman are yelling at a young boy and...
2 -_hbPLsZvvo_5_8 0 1 A white dog is barking and a caption is writte... 3 2 a dog barking and cooking with her master in t...
... ... ... ... ... ... ...
I am trying to calculate a similarity score between the two columns cap1 and cap2 .我正在尝试计算cap1和cap2两列之间的相似度得分。 However, I want to create a new column FSim that stores this similarity score for each row.但是,我想创建一个新列FSim来存储每一行的相似度分数。
The code I have implemented till now is:到目前为止我已经实现的代码是:
#The function that calculates the similarity score
def get_cosine_similarity(feature_vec_1, feature_vec_2):
return cosine_similarity(feature_vec_1.reshape(1, -1), feature_vec_2.reshape(1, -1))[0][0]
for i, row in merged.iterrows():
captions = []
captions.append(row['cap1'])
captions.append(row['cap2'])
for c in range(len(captions)):
captions[c] = pre_process(captions[c])
captions[c] = lemmatize_sentence(captions[c])
feature_vectors = tfidf_vectorizer.transform(captions)
fsims = get_cosine_similarity(feature_vectors[0], feature_vectors[1])
merged['fsim'] = fsim
But I am getting the same similarity scored stored for each row like this:但是我得到了为每一行存储的相同相似度,如下所示:
fsim
0 0.054464
1 0.054464
2 0.054464
3 0.054464
4 0.054464
Same value for all the rows.所有行的值相同。
How to get properly stored the score for each row?如何正确存储每一行的分数?
How about this?这个怎么样? (I'm assuming the DataFrame you have first is merged
) (我假设您首先拥有的 DataFrame 已merged
)
def preproc_and_lemmatize(x):
v1 = pre_process(x)
return lemmatize_sentence(v1)
def calc_sim(x, y):
x2 = preproc_and_lemmatize(x)
y2 = preproc_and_lemmatize(y)
feature_vectors = tfidf_vectorize.transform([x2, y2])
return get_cosine_similarity(feature_vectors[0], feature_vectors[1])
merged['fsim'] = [
calc_sim(x, y) for x, y in zip(merged['cap1'], merged['cap2'])
]
If you prefer to less edit, this will work.如果您喜欢较少的编辑,这将起作用。
merged["fsim"] = 0
for i, row in merged.iterrows():
captions = []
captions.append(row['cap1'])
captions.append(row['cap2'])
for c in range(len(captions)):
captions[c] = pre_process(captions[c])
captions[c] = lemmatize_sentence(captions[c])
feature_vectors = tfidf_vectorizer.transform(captions)
fsims = get_cosine_similarity(feature_vectors[0], feature_vectors[1])
merged['fsim'].iloc[i] = fsims
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.