如何通过遍历 Python 中的 dataframe 中的每一行来将计算值存储在新列中？

Question

The dataframe I am working with looks like this:我正在使用的 dataframe 如下所示：

  vid2               FStart FEnd cap2                                               VDuration  COS  cap1
0 -_aaMGK6GGw_57_61  0      3    A man grabbed a boy from his collar and threw ...  4          2    A man and woman are yelling at a young boy and...
1 -_aaMGK6GGw_57_61  3      4    A lady is waking up a man lying on a chair and...  4          2    A man and woman are yelling at a young boy and...
2 -_hbPLsZvvo_5_8    0      1    A white dog is barking and a caption is writte...  3          2    a dog barking and cooking with her master in t...
  ...                ...    ...  ...                                                ...        ...  ...

I am trying to calculate a similarity score between the two columns cap1 and cap2 .我正在尝试计算cap1和cap2两列之间的相似度得分。 However, I want to create a new column FSim that stores this similarity score for each row.但是，我想创建一个新列FSim来存储每一行的相似度分数。

The code I have implemented till now is:到目前为止我已经实现的代码是：

#The function that calculates the similarity score
def get_cosine_similarity(feature_vec_1, feature_vec_2):    
    return cosine_similarity(feature_vec_1.reshape(1, -1), feature_vec_2.reshape(1, -1))[0][0]


for i, row in merged.iterrows():
    captions = []
    captions.append(row['cap1'])
    captions.append(row['cap2'])

    for c in range(len(captions)):
        captions[c] = pre_process(captions[c])
        captions[c] = lemmatize_sentence(captions[c])

    feature_vectors = tfidf_vectorizer.transform(captions)

    fsims = get_cosine_similarity(feature_vectors[0], feature_vectors[1])
    merged['fsim'] = fsim

But I am getting the same similarity scored stored for each row like this:但是我得到了为每一行存储的相同相似度，如下所示：

       fsim  
0  0.054464  
1  0.054464  
2  0.054464  
3  0.054464  
4  0.054464

Same value for all the rows.所有行的值相同。

How to get properly stored the score for each row?如何正确存储每一行的分数？

Answer 1

How about this?这个怎么样？ (I'm assuming the DataFrame you have first is merged ) （我假设您首先拥有的 DataFrame 已merged ）

def preproc_and_lemmatize(x):
  v1 = pre_process(x)
  return lemmatize_sentence(v1)

def calc_sim(x, y):
  x2 = preproc_and_lemmatize(x)
  y2 = preproc_and_lemmatize(y)
  feature_vectors = tfidf_vectorize.transform([x2, y2])
  return get_cosine_similarity(feature_vectors[0], feature_vectors[1])

merged['fsim'] = [
  calc_sim(x, y) for x, y in zip(merged['cap1'], merged['cap2'])
]

If you prefer to less edit, this will work.如果您喜欢较少的编辑，这将起作用。

merged["fsim"] = 0
for i, row in merged.iterrows():
    captions = []
    captions.append(row['cap1'])
    captions.append(row['cap2'])

    for c in range(len(captions)):
        captions[c] = pre_process(captions[c])
        captions[c] = lemmatize_sentence(captions[c])

    feature_vectors = tfidf_vectorizer.transform(captions)

    fsims = get_cosine_similarity(feature_vectors[0], feature_vectors[1])
    merged['fsim'].iloc[i] = fsims

如何通过遍历 Python 中的 dataframe 中的每一行来将计算值存储在新列中？

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-05-09 06:07:46

如何通过遍历 Python 中的 dataframe 中的每一行来将计算值存储在新列中？

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-05-09 06:07:46

解决方案1
0 已采纳 2020-05-09 06:07:46