使用python在for循环中的字符串中找到相似的单词

Question

I'm working with tweets and after text processing , the code returns something like: 我正在处理推文，并且经过文本处理后，代码返回如下内容：

Lorem ipsum dolor sit amaet vi Lorem ipsum dolor西特阿梅特六
Lorem ipsum dolor sit amaet Lorem ipsum dolor坐在amaet
Lorem ipsum dolor sit amaet via Lorem ipsum dolor坐在amaet通过

So sqlite database identify these records as unique. 因此，sqlite数据库将这些记录标识为唯一。 My question is how can I find if two strings contains 5 similar words then skip it? 我的问题是如何找到两个字符串是否包含5个相似的单词然后跳过呢？ Should I change my regex code or add if statement ? 我应该更改我的正则表达式代码还是添加if statement ？

My code: 我的代码：

        clean1 = re.sub(r"(?:@\S*|#\S*|http(?=.*://)\S*)", "", tweet.text)
        clean2 = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t:])|(\w+:\/\/\S+)", " ", clean1)
        final = re.sub(r'^RT[\s]+', '', clean2)

Thanks! 谢谢！

Answer 1

I don't think regex will help in this situation 我认为正则表达式在这种情况下不会有所帮助

You could do this to tell if two lines have 5 same words 您可以这样做来判断两行是否有5个相同的单词

str1 = "Lorem ipsum dolor sit amaet vi" 
str2 = "Lorem ipsum dolor sit amaet"

count = 0 
str1_split = str1.split(" ")
for word in str2.split(" "):
    if word in str1_split:
        count += 1

print count

Answer 2

Here is the method to count same words in two string: 这是对两个字符串中的相同单词进行计数的方法：

a="Lorem ipsum dolor sit amaet vi"
b="Lorem ipsum dolor sit amaet"
count=0
for i,j in zip(a.split(),b.split()):
    if i==j:
        count+=1
print count

Output: 输出：

使用python在for循环中的字符串中找到相似的单词

问题描述

2 个解决方案

解决方案1
2 已采纳 2017-08-03 03:01:33

解决方案2
0 2017-08-03 03:31:17

使用python在for循环中的字符串中找到相似的单词

问题描述

2 个解决方案

解决方案1 2 已采纳 2017-08-03 03:01:33

解决方案2 0 2017-08-03 03:31:17

解决方案1
2 已采纳 2017-08-03 03:01:33

解决方案2
0 2017-08-03 03:31:17