简体   繁体   中英

How can i merge two datasets with similar words in python?

For instance i have a row value on the dataset_1: "Entity" = Apple

dataset_2: "Entity" = iCloud Apple

(Entity is the column) I need to merge one dataset to the other by the column entity, but to do that i need them to have exacly the same value and Apple ≠ iCloud Apple.

Both datasets are huge so i cant do this manually, one by one.

dataset_1

dataset_2

Code:

`
# preparing data
dataset_1 = {"Entity": 'Prudential Insurance Company of America - Unisys', 'Bank': 'America'}
dataset_2 = {"Entity": 'Unisys', 'Bank': 'Africkan', 'code': '70000-000'}
ds_array = [dataset_1, dataset_2]
# end of preparing data

for d1 in ds_array[0:len(ds_array) - 1]:
    n1 = d1['Entity'].split()
    n1 = {x for x in n1 if len(x) >= 5} # discards words with less than 5 letters
    for d2 in ds_array[1:len(ds_array)]:
        n2 = d2['Entity'].split()
        n2 = {x for x in n2 if len(x) >= 5}
        merge = n1 & n2 # only words in both sets: n1 and n2
        if len(merge) > 0: # tests if there is at least 1 word
            d1['Entity'] = ' '.join(merge)
            d2['Entity'] = d1['Entity']
print(ds_array)
`

Output: [{'Entity': 'Unisys', 'Bank': 'America'}, {'Entity': 'Unisys', 'Bank': 'Africkan', 'code': '70000-000'}]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM