简体   繁体   English

使用python在for循环中的字符串中找到相似的单词

[英]Find similar words in strings in a for loop with python

I'm working with tweets and after text processing , the code returns something like: 我正在处理推文,并且经过文本处理后,代码返回如下内容:

  • Lorem ipsum dolor sit amaet vi Lorem ipsum dolor西特阿梅特六
  • Lorem ipsum dolor sit amaet Lorem ipsum dolor坐在amaet
  • Lorem ipsum dolor sit amaet via Lorem ipsum dolor坐在amaet通过

So sqlite database identify these records as unique. 因此,sqlite数据库将这些记录标识为唯一。 My question is how can I find if two strings contains 5 similar words then skip it? 我的问题是如何找到两个字符串是否包含5个相似的单词然后跳过呢? Should I change my regex code or add if statement ? 我应该更改我的正则表达式代码还是添加if statement

My code: 我的代码:

        clean1 = re.sub(r"(?:@\S*|#\S*|http(?=.*://)\S*)", "", tweet.text)
        clean2 = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t:])|(\w+:\/\/\S+)", " ", clean1)
        final = re.sub(r'^RT[\s]+', '', clean2)

Thanks! 谢谢!

I don't think regex will help in this situation 我认为正则表达式在这种情况下不会有所帮助

You could do this to tell if two lines have 5 same words 您可以这样做来判断两行是否有5个相同的单词

str1 = "Lorem ipsum dolor sit amaet vi" 
str2 = "Lorem ipsum dolor sit amaet"

count = 0 
str1_split = str1.split(" ")
for word in str2.split(" "):
    if word in str1_split:
        count += 1

print count

Here is the method to count same words in two string: 这是对两个字符串中的相同单词进行计数的方法:

a="Lorem ipsum dolor sit amaet vi"
b="Lorem ipsum dolor sit amaet"
count=0
for i,j in zip(a.split(),b.split()):
    if i==j:
        count+=1
print count

Output: 输出:

5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM