简体   繁体   中英

Combining 2 .csv files by common column

I have two .csv files where the first line in file 1 is:

MPID,Title,Description,Model,Category ID,Category Description,Subcategory ID,Subcategory Description,Manufacturer ID,Manufacturer Description,URL,Manufacturer (Brand) URL,Image URL,AR Price,Price,Ship Price,Stock,Condition

The first line from file 2:

Regular Price,Sale Price,Manufacturer Name,Model Number,Retailer Category,Buy URL,Product Name,Availability,Shipping Cost,Condition,MPID,Image URL,UPC,Description

and then rest of every file is filled with info.

As you can see, both files have a common field called MPID (file 1: col 1, file 2: col 9, where the first col is col 1).

I would like to create a new file which will combine these two files by looking at this column (as in: if there is an MPID that is in both files, then in the new file this MPID will appear with both its row from file 1 and its row from file 2). IF one MPID appears only in one file then it should also go into this combined file.

The files are not sorted in any way.

How do I do this on a debian machine with either a shell script or python?

Thanks.

EDIT: Both files dont have commas other than the ones separating the fields.

sort -t , -k index1 file1 > sorted1
sort -t , -k index2 file2 > sorted2
join -t , -1 index1 -2 index2 -a 1 -a 2 sorted1 sorted2

This is the classical "relational join" problem.

You have several algorithms.

  • Nested Loops. You read from one file to pick a "master" record. You read the entire other file locating all "detail" records that match the master. This is a bad idea.

  • Sort-Merge. You sort each file into a temporary copy based on the common key. You then merge both files by reading from the master and then reading all matching rows from the detail and writing the merged records.

  • Lookup. You read one of the files entirely into a dictionary in memory, indexed by the key field. This can be tricky for the detail file, where you'll have multiple children per key. Then you read the other file and lookup the matching records in the dictionary.

Of these, sort-merge is often the fastest. This is done entirely using the unix sort command.

Lookup Implementation

import csv
import collections

index = collections.defaultdict(list)

file1= open( "someFile", "rb" )
rdr= csv.DictReader( file1 )
for row in rdr:
    index[row['MPID']].append( row )
file1.close()

file2= open( "anotherFile", "rb" )
rdr= csv.DictReader( file2 )
for row in rdr:
    print row, index[row['MPID']]
file2.close()

You'll need to look at the join command in the shell. You will also need to sort the data, and probably lose the first lines. The whole process will fall flat if any of the data contains commas. Or you will need to process the data with a CSV-sensitive process that introduces a different field separator (perhaps control-A) that you can use to split fields unambiguously.

The alternative, using Python, reads the two files into a pair of dictionaries (keyed on the common column(s)) and then use a loop to cover all the elements in the smaller of the two dictionaries, looking for matching values in the other. (This is basic nested loop query processing.)

For merging multiple files (even > 2) based on one or more common columns, one of the best and efficient approaches in python would be to use "brewery". You could even specify what fields need to be considered for merging and what fields need to be saved.

import brewery
from brewery
import ds
import sys

sources = [
    {"file": "grants_2008.csv",
     "fields": ["receiver", "amount", "date"]},
    {"file": "grants_2009.csv",
     "fields": ["id", "receiver", "amount", "contract_number", "date"]},
    {"file": "grants_2010.csv",
     "fields": ["receiver", "subject", "requested_amount", "amount", "date"]}
]

Create list of all fields and add filename to store information about origin of data records.Go through source definitions and collect the fields:

for source in sources:
    for field in source["fields"]:
        if field not in all_fields:

out = ds.CSVDataTarget("merged.csv")
out.fields = brewery.FieldList(all_fields)
out.initialize()

for source in sources:

    path = source["file"]

# Initialize data source: skip reading of headers
# use XLSDataSource for XLS files
# We ignore the fields in the header, because we have set-up fields
# previously. We need to skip the header row.

    src = ds.CSVDataSource(path,read_header=False,skip_rows=1)

    src.fields = ds.FieldList(source["fields"])

    src.initialize()


    for record in src.records():

   # Add file reference into ouput - to know where the row comes from
    record["file"] = path

        out.append(record)

# Close the source stream

    src.finalize()


cat merged.csv | brewery pipe pretty_printer

It seems that you're trying to do in a shell script, which is commonly done using SQL server. Is it possible to use SQL for that task? For example, you could import both files into mysql, then create a join, then export it to CSV.

You could take a look at my FOSS project CSVfix , which is a stream editor for manipulating CSV files. It supports joins, among its other features, and requires no scripting to use.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM