简体   繁体   English

从字符串列表中提取特征

[英]Extracting features from list of strings

I have list of strings and a strings that look like this:我有一个字符串列表和一个看起来像这样的字符串:

    mylist = ["the yam is sweet", "what is the best time to come", "who ate my food", "no empty food on the table", "what can I do to make you happy"]  # about 20k data
    myString1 = "Is yam a food"  # String can be longer than this
    myString2 = "should I give you a food"
    myString3 = "I am not happy"

I want to compare each of the myString to each string in my list and collect the percentage of similarity in three different lists.我想将每个 myString 与列表中的每个字符串进行比较,并收集三个不同列表中的相似性百分比。 So the end result will look like this:所以最终结果将如下所示:

   similar_string1 = [70, 0.5, 50, 55, 2]
   similar_string2 = [50, 0.5, 70, 85, 2]
   similar_string3 = [20, 15, 0, 5, 80]

So mystring1 will be compare to each string in mylist and calculate the percentage similarity.因此 mystring1 将与 mylist 中的每个字符串进行比较并计算相似度百分比。 Same with myString2 and myString3.与 myString2 和 myString3 相同。 Then collect each of those percentage in a list as seen above.然后将这些百分比中的每一个收集到一个列表中,如上所示。

I read that one can use TF-IDF to vectorize mylist and mystring, then use cosine similarity to compare them, but I never work on something like this before and I will love if anyone has an idea, process or code that will help me get started.我读到有人可以使用 TF-IDF 对 mylist 和 mystring 进行矢量化,然后使用余弦相似度来比较它们,但我以前从未做过这样的事情,如果有人有帮助我的想法、过程或代码,我会很高兴开始了。

Thanks谢谢

A python implementation to get cosine similarity has already been discussed in Calculate cosine similarity given 2 sentence strings python 实现获得余弦相似度已在计算余弦相似度给定 2 个句子字符串中讨论

You can check above link and use below code snippet:您可以检查上面的链接并使用下面的代码片段:

'''
vector1 = text_to_vector(myString1)
vector2 = text_to_vector(myString2)
vector3 = text_to_vector(myString3)
similar_string1 = []
similar_string2 = []
similar_string3 = []

for ele in mylist:  
    vector = text_to_vector(ele)
    cosine = get_cosine(vector1, vector)
    similar_string1.append(cosine)
    cosine = get_cosine(vector2, vector)
    similar_string2.append(cosine)
    cosine = get_cosine(vector3, vector)
    similar_string3.append(cosine)


        
print(similar_string1)
print(similar_string2)
print(similar_string3)
'''

The names of variables are the same as you mentioned in the question.变量的名称与您在问题中提到的名称相同。 Obviously, this code can be optimized according to your requirement.显然,这段代码可以根据您的要求进行优化。

Let me know if you didn't understand anything.如果你有什么不明白的,请告诉我。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM