简体   繁体   English

用于匹配单个字符串中重复单词的 MySQL 正则表达式模式

[英]MySQL regexp pattern to match repeated words within a single string

Can someone please help me solve a problem that I'm having with MySQL and REGEXP?有人可以帮我解决我在使用 MySQL 和 REGEXP 时遇到的问题吗?

I am working on cleaning MySQL table containing vehicle inventory.我正在清理包含车辆库存的 MySQL 表。 The table has several million rows.该表有几百万行。 I am trying to come up with a regex pattern that will find repeated words in each cell and replace only one of them with a SPACE character keeping the other.我正在尝试提出一种正则表达式模式,该模式将在每个单元格中找到重复的单词,并仅用保留另一个的空格字符替换其中一个。 Here is an example of my table.这是我的表格的示例。 There are many more columns in that table, but I only included a few for demonstration purpose.该表中还有更多列,但出于演示目的,我只包含了一些列。

在此处输入图像描述

If you notice, the 2 columns MAKE and MODEL contain repeated words (ie "FORD FORD", "TOYOTA TOYOTA" etc.).如果您注意到,MAKE 和 MODEL 两列包含重复的词(即“FORD FORD”、“TOYOTA TOYOTA”等)。 This table was loaded from an old Excel file that used to be maintained manually.该表是从过去手动维护的旧 Excel 文件加载的。 As you can see, the data is extremely dirty.如您所见,数据非常脏。 I'm trying to do as much cleaning as possible to standardize the data.我正在尝试尽可能多地清理以标准化数据。 I want to keep only one copy of each repeated word removing the duplicates (ie "FORD", "TOYOTA", "NISSAN" etc.).我只想保留每个重复单词的一个副本,删除重复项(即“FORD”、“TOYOTA”、“NISSAN”等)。

I was able to solve this problem partially (see code below):我能够部分解决这个问题(见下面的代码):

update t_inventory
set make = trim(regexp_replace(make, '(\\([A-Za-z]+\\))', ' '))
where make regexp '^([A-Za-z]+)([^a-zA-Z0-9]+)(\\([A-Za-z]+\\))'
    and mid(make, 1, instr(make, '(') - 2) = 
        mid(make, instr(make, '(') + 1, instr(make, ')') - instr(make, '(') - 1);

The above code solves the problem for the values like "FORD (FORD)" or "TOYOTA (TOYOTA)" where first word is unwrapped, second word is inside parentheses and no other leading or trailing characters.上面的代码解决了“FORD (FORD)”或“TOYOTA (TOYOTA)”等值的问题,其中第一个单词是展开的,第二个单词在括号内并且没有其他前导或尾随字符。 But when I have a string like "MAKE NISSAN (NISSAN)" the above code won't work.但是当我有一个像“MAKE NISSAN (NISSAN)”这样的字符串时,上面的代码将不起作用。 It will replace word NISSAN with SPACE leaving only word MAKE.它将用 SPACE 替换单词 NISSAN,只留下单词 MAKE。

Is there any way to write a single REGEXP pattern to remove all repeated words only keeping one?有什么方法可以编写一个 REGEXP 模式来删除所有重复的单词,只保留一个? I don't even care if parentheses are kept.我什至不在乎是否保留括号。 I can easily clean them later.我可以稍后轻松清洁它们。

You'll probably ask why not find all possible garbage, create a dictionary and then write a procedure to filter it out.您可能会问,为什么不找出所有可能的垃圾,创建一个字典,然后编写一个过程将其过滤掉。 Yes, it would be ideal if the table had a few hundred to a few thousand rows.是的,如果表格有几百到几千行就比较理想了。 But my table has millions of rows.但是我的表有数百万行。 As I mentioned above, this data was migrated from Excel file that was maintained manually for over 20 years.正如我上面提到的,这些数据是从手动维护了 20 多年的 Excel 文件迁移而来的。 It's hard to imagine how dirty the data there is.很难想象那里的数据有多脏。 What you see in the diagram above is as simple as it can get.您在上图中看到的非常简单。 I wouldn't have asked for help if it wasn't as complex.如果不是那么复杂,我不会寻求帮助。

I really appreciate your help.非常感谢你的帮助。 Thank you so much in advance!非常感谢您!

Dirty data is often too chaotic to fix in a single UPDATE.脏数据通常过于混乱,无法在单个 UPDATE 中修复。

Answer: use more than one UPDATE!答:使用多个UPDATE!

UPDATE t_inventory
SET make = TRIM(LEADING 'MAKE' FROM make);

UPDATE t_inventory
SET make = REPLACE(make, 'FORD (FORD)', 'FORD');

UPDATE t_inventory
SET make = REPLACE(make, 'NISSAN (NISSAN)', 'NISSAN');

UPDATE t_inventory
SET make = REPLACE(make, 'HONDA (HONDA)', 'HONDA');

...and so on...

Every such edit is very simple to write.每一个这样的编辑都非常容易编写。

You will probably now ask if you can also change NISSAN (NISSAN in the same UPDATE.您现在可能会问是否也可以更改NISSAN (NISSAN同一更新中的 NISSAN。

You're still thinking about combining the edits.您仍在考虑合并编辑。 Stop that.不要那么做。 Just do multiple edits.只需进行多次编辑。

UPDATE t_inventory
SET make = REPLACE(make, 'NISSAN (NISSAN', 'NISSAN');

It does take longer to execute multiple edits.执行多个编辑确实需要更长的时间。 I understand you said your table has millions of rows.我知道你说你的表有数百万行。 But if you compare to the time it takes you to develop a clever way of combining the edits, it's probably a wash.但是,如果与开发一种巧妙的编辑组合方式所花费的时间相比,这可能是一次浪费。 Besides, computers are good at executing the change over the millions of rows.此外,计算机擅长对数百万行执行更改。 You just have to wait for it to finish.你只需要等待它完成。

mysql> SELECT REGEXP_REPLACE("FORD (FORD)", '\\b(\\w+)\\b(.*)\\b(\\1)\\b(.*)$', '$1$2$4');
+-----------------------------------------------------------------------------+
| REGEXP_REPLACE("FORD (FORD)", '\\b(\\w+)\\b(.*)\\b(\\1)\\b(.*)$', '$1$2$4') |
+-----------------------------------------------------------------------------+
| FORD ()                                                                     |
+-----------------------------------------------------------------------------+

That used version 8.0.31;使用的是 8.0.31 版本; another version may have different syntax.另一个版本可能有不同的语法。

Note that the replacement rebuilt the string without the second (that is $3 ) copy of 'FORD'.请注意,替换重建的字符串没有“FORD”的第二个(即$3 )副本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM