简体   繁体   English

遍历 dataframe 行以匹配列表中的单词

[英]Iterate through dataframe rows to match word in list

My goal is to measure similarities between the rows of a dataframe and a list of words.我的目标是测量 dataframe 的行和单词列表之间的相似性。 My code looks like this:我的代码如下所示:

import pandas as pd
import distance
import numpy as np
df = pd.DataFrame({'col': ['apps','orange juice','citrs']})
li = ['apple','orange','citrus']
df['SIM'] = np.nan
df['SIM_COL'] = np.nan
for row in df.iterrows():
    row_data = row[1].tolist()
    for l in li:
        if distance.jaccard(row_data[0],l) < 0.5:
            df.loc[[df[df['col']==row_data[0]].index.values[0]],'SIM']= distance.jaccard(row_data[0],l)
            df.loc[[df[df['col']==row_data[0]].index.values[0]],'SIM_COL']= l
            break

And this is my output:这是我的 output:

    col SIM SIM_COL
0   apps    NaN NaN
1   orange juice    0.454545    orange
2   citrs   0.166667    citrus

This is fine when i make the distance condition < 0.5 .当我使距离条件< 0.5时,这很好。 If i change it to 1 , my output becomes:如果我将其更改为1 ,我的 output 将变为:

    col SIM SIM_COL
0   apps    0.600000    apple
1   orange juice    0.846154    apple
2   citrs   0.900000    orange

Now it gives me the wrong result for orange and citrus.现在它给了我橙子和柑橘的错误结果。 How can i make it so only the lowest distances are considered?我怎样才能做到只考虑最低距离?

The result is right.结果是对的。 See

print(distance.jaccard('orange juice', 'apple'))

# 0.846154

How can i make it so only the lowest distances are considered?我怎样才能做到只考虑最低距离?

I would use an extra variable min_dist to record the lowest distances.我会使用一个额外的变量min_dist来记录最低距离。 Update df['SIM'] and df['SIM_COL'] , only if the new distance is smaller than the current lowest distances.仅当新距离小于当前最低距离时,才更新df['SIM']df['SIM_COL']

for row in df.iterrows():
    row_data = row[1].tolist()
    min_dist = 999  # Init with a big value

    for l in li:
        dist = distance.jaccard(row_data[0], l)
        if dist < 1 and dist < min_dist:
            min_dist = dist
            df.loc[[df[df['col']==row_data[0]].index.values[0]],'SIM'] = dist
            df.loc[[df[df['col']==row_data[0]].index.values[0]],'SIM_COL']= l

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pandas:遍历列表以匹配 dataframe 中的值 - Pandas: iterate through a list to match values in a dataframe 遍历 Pandas 数据框中的行并匹配列表中的元组并创建一个新的 df 列 - Iterate through rows in pandas dataframe and match tuples from a list and create a new df column 遍历 pandas 数据框中的行并匹配列表字典中的值以创建新列 - Iterate through rows in pandas dataframe and match values in a dictionary of lists to create a new column 从行列表创建DataFrame并对其进行迭代 - Create DataFrame from list of rows, and Iterate Over it 有没有办法遍历数据框中的字符串列表? - Is there a way to iterate through list of string in a dataframe? 遍历列表和 append 结果到 pandas dataframe - Iterate through a list and append results to a pandas dataframe 有没有办法将 DataFrame 和 append 的行迭代到单独的 DataFrame? - Is there a way to iterate through rows of a DataFrame and append some to a separate DataFrame? 遍历多行数据帧并根据条件删除行 - Iterate through multiple rows of dataframe and dropping rows based on condition 尝试遍历熊猫数据框的行并在满足条件时编辑行 - Trying to iterate through rows of pandas dataframe and edit row if it satisfies a condition 如何遍历 pandas dataframe 并找到具有给定值的所有行 - How to iterate through a pandas dataframe and find all rows with given value
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM