简体   繁体   中英

Python - How to find and merge duplicates in list of addresses (which are lists themselves)

I have a list of addresses that I scraped from a site and I would like to compare that to a list of addresses from a previous scrape to merge and remove the duplicates. The list I scraped has zip codes and sometimes cities while the previous scraped list only has cities, no zip. My end goal is to merge all the duplicates and leave entries without duplicates.

Here is an example of 2 entries that I would like to merge:

['2 DUDES RC HOBBIES', 'SALT LAKE CITY', 'UT', '84106', '(801) 849-0292']

['2 DUDES RC HOBBIES', 'SALT LAKE CITY', 'UT', '', '(801) 849-0292']

Here's another snippet:

['1 STOP HOBBY & CRAFT SHOP', 'BATH', 'NY', '', '(607) 776-9293']

['1/32 SLOTCAR RACEWAY', 'UNIVERSITY PLACE', 'WA', '', '(253) 255-1807']

['2 DUDES RC HOBBIES', 'SALT LAKE CITY', 'UT', '84106', '(801) 849-0292']

['2 DUDES RC HOBBIES', 'SALT LAKE CITY', 'UT', '', '(801) 849-0292']

['2 JACKS HOBBIES AND MORE', 'LAFAYETTE', 'LA', '70507', '(337) 212-2916']

['2 JACKS HOBBIES AND MORE', 'LAFAYETTE', 'LA', '', '(337) 212-2916']

['3D HOBBIES', 'SOCIAL CIRCLE', 'GA', '', '(678) 283-9662']

['3DXHOBBIES', 'GREEN BROOK', 'NJ', '', '(732) 424-6400']

['5TH GEAR POWERSPORTS', 'ELKO', 'NV', '89801', '(775) 777-3373']

['5TH GEAR POWERSPORTS', 'ELKO', 'NV', '', '(775) 777-3373']

Entries 1 and 2 should stay while the 3rd and 4th should be merged

EDIT:

I apologize if I wasn't clear, this is my first time posting a question and I still have a lot to learn. I'll try to explain my question better.

I have two CSV files that have a list of stores. Both files have the same fields; NAME, CITY, STATE, ZIP, PHONE. One file has data in the ZIP column while the other does not. The goal is to end up with a CSV file that has the stores that are unique to the file with zip codes.

without_zips = []
with open('without_zips.csv', newline='') as f:
    reader = csv.reader(f)
    for row in reader:
        without_zips.append(row)

with_zips = []
with open('with_zips.csv', newline='') as f:
    reader = csv.reader(f)
    for row in reader:
        with_zips.append(row)

Using the method @kzimmerman suggested worked but I switched each instance of without_zips with with_zips . The following code worked:

for without_zip_entry in without_zips:
    this_telephone = without_zip_entry[-1]
    for i, zip_entry in enumerate(with_zips):
        that_telephone = zip_entry[-1]
        if this_telephone in that_telephone:
            # Remove duplicate without zip code
            del with_zips[i]

Well, this answer is not in a pythonic way, however at least it should give you an idea where to go next.

a = ['1 STOP HOBBY & CRAFT SHOP', 'BATH', 'NY', '', '(607) 776-9293']
a1 = ['1/32 SLOTCAR RACEWAY', 'UNIVERSITY PLACE', 'WA', '', '(253) 255-1807']
a2 = ['2 DUDES RC HOBBIES', 'SALT LAKE CITY', 'UT', '84106', '(801) 849-0292']
a3 = ['2 DUDES RC HOBBIES', 'SALT LAKE CITY', 'UT', '', '(801) 849-0292']
a4 = ['2 JACKS HOBBIES AND MORE', 'LAFAYETTE', 'LA', '70507', '(337) 212-2916']
a5 = ['2 JACKS HOBBIES AND MORE', 'LAFAYETTE', 'LA', '', '(337) 212-2916']
a6 = ['3D HOBBIES', 'SOCIAL CIRCLE', 'GA', '', '(678) 283-9662']
a7 = ['3DXHOBBIES', 'GREEN BROOK', 'NJ', '', '(732) 424-6400']
a8 = ['5TH GEAR POWERSPORTS', 'ELKO', 'NV', '89801', '(775) 777-3373']
a9 = ['5TH GEAR POWERSPORTS', 'ELKO', 'NV', '', '(775) 777-3373']

data = [a, a1, a2, a3, a4, a5, a6, a7, a8, a9]

result = dict()

for item in data:
    key = item[0]
    if key in result.keys():
        # merge them here
        if item[4]:
            result[key][4] = item[4]
        continue

    result[key] = item

for item in result.values():
    print item

Here is what you have to do: - Define a unique key in each list. You should know it in order to much the lists. If you want to use multiple of them - it is still possible, but little harder. Read #better-options part if you do. - Define how you are going to merge them. What data is considered to be valid. - Delete invalid data or store valid and merged data in other structure.

Better options

Option 1.

You probably do not want to do it in the way I described above as if you have millions of rows you would be better to save them to the database. SQLite would be a good choice. Algorithm is mostly the same and could take longer however the data will be persistent and you will not loose it if something happens in code or with temporary memory.

Option 2.

If you are trying to do anything related to data science you are probably using pandas which has a great way of grouping DataFrame's by a field.

Does it answer your question?

If you can reasonably assume that the phone numbers should be unique (we cannot usually rely on names, cities, states, and zip codes to be unique) then something like the following might be a solution for you. Unfortunately, the following solution can potentially have a long running time depending on the size of the lists ( O(n^2) ).

without_zips = [['1 STOP HOBBY & CRAFT SHOP', 'BATH', 'NY', '', '(607) 776-9293'],
                ['1/32 SLOTCAR RACEWAY', 'UNIVERSITY PLACE', 'WA', '', '(253) 255-1807'],
                ['2 DUDES RC HOBBIES', 'SALT LAKE CITY', 'UT', '', '(801) 849-0292'],
                ['2 JACKS HOBBIES AND MORE', 'LAFAYETTE', 'LA', '', '(337) 212-2916'],
                ['3D HOBBIES', 'SOCIAL CIRCLE', 'GA', '', '(678) 283-9662'],
                ['3DXHOBBIES', 'GREEN BROOK', 'NJ', '', '(732) 424-6400'],
                ['5TH GEAR POWERSPORTS', 'ELKO', 'NV', '', '(775) 777-3373']]


with_zips =   [['2 DUDES RC HOBBIES', 'SALT LAKE CITY', 'UT', '84106', '(801) 849-0292'],
               ['2 JACKS HOBBIES AND MORE', 'LAFAYETTE', 'LA', '70507', '(337) 212-2916'],
               ['5TH GEAR POWERSPORTS', 'ELKO', 'NV', '89801', '(775) 777-3373']]



for with_zip_entry in with_zips:
    this_telephone = with_zip_entry[-1]
    for i, no_zip_entry in enumerate(without_zips):
        that_telephone = no_zip_entry[-1]
        if this_telephone in that_telephone:
            # Remove duplicate without zip code
            del without_zips[i]


print(without_zips+with_zips)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM