简体   繁体   中英

Merge ALMOST identical data rows

我有大量的数据(英国和美国邮政地址)超过100,000,其中在几乎相同的行中包含重复或ALMOST相同的数据行(具有5列),这五列中的四列具有完全匹配的数据,例如: AAAA BBBB CCCCCC CCCCCCCC CCCCCCCC 11.111 22.222 AAAA BBBB CCCCCC CCCCCCCC 11.111 22.222 DDDD EEEE FF FFFFF FFFFF FFFFFFFFF 33.33 44.444 DDDD EEEE FF FFFFF FFFFF 33.33 44.444 GGGG HHHH IIII IIIII IIIIIIII 55.555 66.666 GGGG HHHH IIII IIIII 55.555 66.666这些重复的(或几乎重复的行)我无法管理,我想要结束的是: AAAA BBBB CCCCCC CCCCCCCC CCCCCCCC 11.111 22.222 DDDD EEEE FF FFFFF FFFFF FFFFFFFFF 33.33 44.444 GGGG HHHH IIII IIIII IIIIIIII 55.555 66.666例如,丢弃列“更短”的数据长度

You can achieve this by doing following steps - 1. sort on column 1 2. sort on column 2 3. sort on column 4 4. sort on column 5 5. Reorder rows permanently (open on top) now you would see that - all the rows sorted permanently. do blank down on column 1.
Result would be -
===============================================================

AAAA BBBB CCCCCC CCCCCCCC CCCCCCCC 11.111 22.222
BBBB CCCCCC CCCCCCCC 11.111 22.222
DDDD EEEE FF FFFFF FFFFF FFFFFFFFF 33.33 44.444
EEEE FF FFFFF FFFFF 33.33 44.444
GGGG HHHH IIII IIIII IIIIIIII 55.555 66.666
HHHH IIII IIIII 55.555 66.666

===================================================================  

now select all the rows with blank on first column and delete all the rows.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM