[英]Iterate through dataframe rows to match word in list
My goal is to measure similarities between the rows of a dataframe and a list of words.我的目标是测量 dataframe 的行和单词列表之间的相似性。 My code looks like this:
我的代码如下所示:
import pandas as pd
import distance
import numpy as np
df = pd.DataFrame({'col': ['apps','orange juice','citrs']})
li = ['apple','orange','citrus']
df['SIM'] = np.nan
df['SIM_COL'] = np.nan
for row in df.iterrows():
row_data = row[1].tolist()
for l in li:
if distance.jaccard(row_data[0],l) < 0.5:
df.loc[[df[df['col']==row_data[0]].index.values[0]],'SIM']= distance.jaccard(row_data[0],l)
df.loc[[df[df['col']==row_data[0]].index.values[0]],'SIM_COL']= l
break
And this is my output:这是我的 output:
col SIM SIM_COL
0 apps NaN NaN
1 orange juice 0.454545 orange
2 citrs 0.166667 citrus
This is fine when i make the distance condition < 0.5
.当我使距离条件
< 0.5
时,这很好。 If i change it to 1
, my output becomes:如果我将其更改为
1
,我的 output 将变为:
col SIM SIM_COL
0 apps 0.600000 apple
1 orange juice 0.846154 apple
2 citrs 0.900000 orange
Now it gives me the wrong result for orange and citrus.现在它给了我橙子和柑橘的错误结果。 How can i make it so only the lowest distances are considered?
我怎样才能做到只考虑最低距离?
The result is right.结果是对的。 See
看
print(distance.jaccard('orange juice', 'apple'))
# 0.846154
How can i make it so only the lowest distances are considered?
我怎样才能做到只考虑最低距离?
I would use an extra variable min_dist
to record the lowest distances.我会使用一个额外的变量
min_dist
来记录最低距离。 Update df['SIM']
and df['SIM_COL']
, only if the new distance is smaller than the current lowest distances.仅当新距离小于当前最低距离时,才更新
df['SIM']
和df['SIM_COL']
。
for row in df.iterrows():
row_data = row[1].tolist()
min_dist = 999 # Init with a big value
for l in li:
dist = distance.jaccard(row_data[0], l)
if dist < 1 and dist < min_dist:
min_dist = dist
df.loc[[df[df['col']==row_data[0]].index.values[0]],'SIM'] = dist
df.loc[[df[df['col']==row_data[0]].index.values[0]],'SIM_COL']= l
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.