简体   繁体   English

在B栏(MySQL)的“文本”中找到准确的单词形式A栏

[英]Find a exact word form column A within Text in Column B (MySQL)

i have 2 tables, and try to eleminate all entries in table 1 (multiple words per row) wich contain one of the entries in table 2. These words from table 2 can be somewhere in the strings of Table 1. 我有2个表,并尝试消除表1中的所有条目(每行多个单词),其中包含表2中的条目之一。表2中的这些单词可以位于表1的字符串中。

it should find things like: 'house' in 'big house here' or in 'big house' 它应该找到类似以下内容:“这里的大房子”或“大房子”中的“房子”

it should not find things like this: 'house' in 'houses' 它不应该找到这样的东西:“房子”中的“房子”

I tried to use the locate function like this: 我试图像这样使用定位功能:

CREATE TABLE `test`
AS (
  SELECT
    `table1`.`term1`,
    `table2`.`term2`
  FROM `table1`,`table2`
  WHERE
    locate(concat(' ',`table2`.`term2`,' '), concat(' ',`table1`.`term1`,' '))
);

the problem is: it finds some, but not all, and i cannot see the logic behind there why it is not working for everything. 问题是:它找到了一些但不是全部,而且我看不到背后的逻辑为什么它不能对所有东西都起作用。

If there is any punctuation surrounding the word you're looking for, your matching won't work. 如果您要查找的单词周围有标点符号,则无法进行匹配。

You could replace all punctuation in the field with spaces . 您可以用空格 替换字段中的所有标点符号

However, I think a much cleaner solution would be a regular expression : 但是,我认为一个更简洁的解决方案是一个正则表达式

CREATE TABLE test
AS
SELECT table1.term1, table2.term2
FROM table1, table2
WHERE table1.term1 REGEXP CONCAT('(^|[^A-Za-z]])',table2.term2,'([^A-Za-z]|$)');

(^|[^A-Za-z]) means either start of field or not AZ or az. (^|[^A-Za-z])表示不是字段开始,还是不是AZ或az。
([^A-Za-z]|$) means either not AZ or az or end of field. ([^A-Za-z]|$)表示不是AZ或az或字段结尾。

SQLFiddle . SQLFiddle

EDIT: 编辑:

While the above is pretty and all, it's not particularly efficient. 尽管上面的内容很漂亮,但并不是特别有效。 ( 140 ms in a small test) (小测试中为140 ms

More efficient: ( 80 ms , could be much better on proper data) 效率更高:( 80 ms ,对于适当的数据可能会更好)

SELECT term1, term2
FROM table1, table2
WHERE term1 LIKE CONCAT('%',term2,'%')
  AND term1 REGEXP CONCAT('(^|[^A-Za-z])',term2,'([^A-Za-z]|$)');

Way more efficient: ( 8 ms ) (for some weird reason, MySQL seemingly can't do regex very well) 效率更高:8 ms )(出于某些奇怪的原因,MySQL似乎不能很好地执行正则表达式)

SELECT COUNT(*)
FROM table1, table2
WHERE term1 LIKE CONCAT(term2,' %')
   OR term1 LIKE CONCAT(term2,',%')
   OR term1 LIKE CONCAT(term2,'.%')
   OR term1 LIKE CONCAT(term2,';%')
   OR term1 LIKE CONCAT('% ',term2,' %')
   OR term1 LIKE CONCAT('% ',term2,',%')
   OR term1 LIKE CONCAT('% ',term2,'.%')
   OR term1 LIKE CONCAT('% ',term2,';%')
   OR term1 LIKE CONCAT('% ',term2)

Slightly more efficient: ( 4 ms ) 效率略高:4 ms

SELECT COUNT(*)
FROM table1, table2
WHERE CONCAT(' ', REPLACE(REPLACE(REPLACE(term1, ',', ' '), '.', ' '), ';', ' '), ' ')
        LIKE CONCAT('% ',term2,' %')

You may want to include a few more characters above. 您可能要在上面再添加几个字符。

SQLFiddle . SQLFiddle

Note that much of the above depends on the data, some may be more efficient in some cases and much worse in others (but regex will probably trail behind). 请注意,以上内容大部分取决于数据,在某些情况下某些效率可能更高,而在另一些情况下则更差(但正则表达式可能会落后)。

Even more efficient? 更有效率吗?

Fulltext indices + searching . 全文索引+搜索

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM