简体   繁体   中英

Python merging two CSV by common column

I am trying to merge two csv files with a common column and write it to a new file. For example product.csv table will have columns

      product_id     name        
       1           Handwash      
       2           Soap          

and subproduct.csv will have columns

      product_id subproduct_name volume
       1           Dettol         20
       1           Lifebuoy      50
       2           Lux           100

The output sales.csv file should be like:

  product_id        name      subproduct_name     volume 
       1           Handwash      Dettol            20   
       1           Handwash      Lifebuoy          50
       2           Soap           Lux             100 

I have tried to create two dictionaries:

with open('product.csv', 'r') as f:
r = csv.reader(f)

dict1 = {row[0]: row[1:] for row in r}

with open('subproduct.csv', 'r') as f:
r = csv.reader(f)

dict2 = {row[0]: row[1:] for row in r}

Use pandas:

import pandas as pd

products_df = pd.read_csv('product.csv')
subproducts_df = pd.read_csv('subproduct.csv')

sales_df = pd.merge(products_df, subproducts_df, on=0)

Merging with Pandas

Stage 1 : First Pip install pandas if you haven't done that

Stage 2 : Creating the data

data1 = {'product_id': [1, 2], 
         'name': ['Handwash', 'Soap'], 
              }
data2  {'product_id': [1, 1, 2], 
'subproduct_name': ['Dettol', 'Lifebuoy', 'Lux'], 'volume' : [20, 50, 100]} 

Stage 3: Putting it into dataframe

df1 = pd.DataFrame(data1) 
df2 = pd.DataFrame(data2))

Stage 4: Merging the dataframes

output = pd.merge(df1, df2, how="inner")

Merging with Pandas with CSV

df1=pd.read_csv('product.csv')
df2=pd.read_csv('subproduct.csv')

Do Stage 4

You can work a script with pure python. It has a powerful lib called csv , that should do the trick

import csv

with open('product.csv') as csv_produto:
    with open('subproduct.csv') as csv_subproduct:
        produto_reader = list(csv.reader(csv_produto, delimiter=','))
        subproduct_reader = list(csv.reader(csv_subproduct, delimiter=','))
        for p in produto_reader:
            for sp in subproduct_reader:
                if(p[0]==sp[0]):
                    print('{},{},{},{}'.format(p[0], p[1], sp[1], sp[2]))

That's the main idea, now you can save the output in csv and add a header handling exceptions.

Other have proposed ways using pandas. You should considere it if your files are big, or if you need to do this operation quite often. But the csv module is enough here.

You cannot use plain dicts here because the keys are not unique: subproduct.csv has 2 different rows with the same id 1. So I would use dicts of lists instead.

I will admit here that all keys have to be present in product.csv, but some product may have no associated subproducts (meaning a left outer join in database wordings).

So I will use:

  • a dict for product.csv because I assume that product_id are unique per product
  • a defaultdict of lists for subproduct.csv because a single product may have many subproducts
  • the list of ids from product.csv to build the final file
  • a default empty list for subproduct.csv if a product had no subproducts
  • and process headers separately

Code could be:

with open('product.csv') as f:
    r = csv.reader(f)
    header1 = next(r)
    dict1 = {row[0]: row[1:] for row in r}
dict2 = collections.defaultdict(list)
with open('subproduct.csv', 'r') as f:
    r = csv.reader(f)
    header2 = next(r)
    for row in r:
        dict2[row[0]].append(row[1:])

with open('merged.csv', 'w', newline='') as f:
    w = csv.writer(f)
    _ = w.writerow(header1 + header2[1:])
    empty2 = [[] * (len(header2) - 1)]
    for k in sorted(dict1.keys()):
        for row2 in dict2.get(k, empty2):          # accept no subproducts
            _ = w.writerow([k] + dict1[k] + row2)

Assuming that your csv files are truely Comma Separated Values files, this gives:

product_id,name,subproduct_name,volume
1,Handwash,Dettol,20
1,Handwash,Lifebuoy,50
2,Soap,Lux,100

Please try this:

import pandas as pd

output = pd.merge(product, sub_product, how = 'outer', left_on= 'product_id', right_on = 'product_id')

It's joining two data frames (product and sub_product) by product_id column which is common for both. The outer join returns all records that match the key on both the data frames. Even how = 'inner' would have also worked in this case

You can read the data straight into a pandas dataframes, and then merge the two dataframes:

import pandas as pd

# load data
product = pd.read_csv('product.csv')
subproduct = pd.read_csv('subproduct.csv')

# merge data
merged = pd.merge(product,subproduct)

# write results to csv
merged.to_csv('sales.csv',index=False)

This works perfectly for your example. Depending on how your actual data looks like, you might need to tweak some of the additional arguments of pd.merge.

Edit: added the write to csv part

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM