快速搜索相似的文本

Question

I am supporting a public blog to which users could publish their posts. 我支持一个公共博客，用户可以在该博客上发布他们的帖子。 Some users have more than thousand different texts and they might not remember, that they have already published some text. 一些用户有数千种不同的文本，他们可能不记得他们已经发布了一些文本。 I would like to help users not to publish duplicates. 我想帮助用户不要发布重复项。

Comparing texts for exact equality is not good - user might have changed text a little, or formatting, or copied from a different program, etc. So I need a quick estimate, if there is a similar text in existing database. 比较文本以确保完全相等是不好的-用户可能会稍微更改文本，格式化或从其他程序复制文本等。因此，如果现有数据库中存在相似的文本，则需要快速估算。

My technology stack includes PHP, MySQL and Redis. 我的技术栈包括PHP，MySQL和Redis。 How can I solve my problem using those or other instruments? 如何使用这些工具或其他工具解决我的问题？

Answer 1

PHP has a function called similar_text which you can use to calculate the amount of matching characters or the similarity in percent. PHP具有一个称为likeliant_text的函数，可用于计算匹配字符的数量或相似性百分比。

http://php.net/manual/en/function.similar-text.php http://php.net/manual/en/function.similar-text.php

You could then check if the given text is within a certain margin of older blog posts. 然后，您可以检查给定的文本是否在旧博客文章的一定范围内。

If you don't want to check for similarity in text you could try to tag the posts based on tags of the original blog or subject of the blog. 如果您不想检查文本的相似性，则可以尝试根据原始博客或博客主题的标签来标记帖子。 And then show the users the posts they made with similar tags. 然后向用户显示他们使用类似标签发布的帖子。

Answer 2

You can use MySQL's match - against in a full text indexed column. 在全文索引列反对-您可以使用MySQL的比赛。

As an example: 举个例子：

SELECT table.*, 
MATCH(userText) AGAINST ('this is user input') AS relevancy 
FROM table 
ORDER BY relevancy DESC;

So this will give you results ordered by relevancy. 因此，这将为您提供按相关性排序的结果。

Don't forget to add full text index on column userText . 不要忘记在userText列上添加全文本索引。

快速搜索相似的文本

问题描述

2 个解决方案

解决方案1
1 2014-12-22 11:57:11

解决方案2
1 2014-12-22 11:57:15

快速搜索相似的文本

问题描述

2 个解决方案

解决方案1 1 2014-12-22 11:57:11

解决方案2 1 2014-12-22 11:57:15

解决方案1
1 2014-12-22 11:57:11

解决方案2
1 2014-12-22 11:57:15