[英]Best Approach to correct column with different spellings in mysql
I have a table with column that has data with spelling errors.我有一个表,其中包含包含拼写错误数据的列。 Like: apple, appl, aple bana, banana, banna cat, cot, cta喜欢:apple, appl, aple bana, banana, banna cat, cot, cta
I would like to correct all error spelling to single correct ones.我想将所有错误拼写更正为单个正确拼写。 There are thousands of rows.有数千行。 What would be the best approach to correct this issue where I wouldn't have to update each spelling errors manually?在我不必手动更新每个拼写错误的情况下,纠正此问题的最佳方法是什么? I have added status iscorrect 'Y' for correct ones.我为正确的添加了 status iscorrect 'Y'。
Here's a thought, using SOUNDEX.这是一个想法,使用 SOUNDEX。 SOUNDEX is really a lousy function, and certainly no panacea, but it might reduce a data set comprising thousands of errors to a data set comprising hundreds of errors. SOUNDEX 确实是一个糟糕的 function,当然也不是万能的,但它可能会将包含数千个错误的数据集减少到包含数百个错误的数据集。
For the rest, we can look at things like Levenshtein distance, but ultimately, you're going to need a manual approach to some extent...对于 rest,我们可以查看 Levenshtein 距离之类的东西,但最终,您将需要某种程度的手动方法......
DROP TABLE IF EXISTS bad_data;
CREATE TABLE bad_data
(id SERIAL PRIMARY KEY
,string VARCHAR(12) NOT NULL
);
INSERT INTO bad_data (string) VALUES
('apple'),
('appl'),
('aple'),
('bana'),
('banana'),
('banna'),
('cat'),
('cot'),
('cta');
DROP TABLE IF EXISTS good_data;
CREATE TABLE good_data
(id SERIAL PRIMARY KEY
,string VARCHAR(12) NOT NULL UNIQUE
);
INSERT INTO good_data(string) VALUES
('apple'),
('banana'),
('cat');
SELECT *
FROM bad_data x
JOIN good_data y ON SOUNDEX(x.string) = SOUNDEX(y.string);
+----+--------+------+--------+
| id | string | id | string |
+----+--------+------+--------+
| 1 | apple | 1 | apple |
| 2 | appl | 1 | apple |
| 3 | aple | 1 | apple |
| 4 | bana | 2 | banana |
| 5 | banana | 2 | banana |
| 6 | banna | 2 | banana |
| 7 | cat | 3 | cat |
| 8 | cot | 3 | cat |
| 9 | cta | 3 | cat |
+----+--------+------+--------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.