Compare two CSV Files with Perl

Question

I have two CSV files that I want to compare with Perl.

I have the code to get the files into Perl using Text::CSV::Slurp and it gives me a nice array of hash references for the files.

Using Data::Dumper::Concise shows all my data imports correctly.

use strict;
use warnings;

use Text::CSV::Slurp;
use Data::Dumper::Concise;

my $file1_src = "IPB-CSV.csv";

my $file2_src = "SRM-CSV.csv";

my $IPB = Text::CSV::Slurp->load(file => $file1_src);
my $SRM = Text::CSV::Slurp->load(file => $file2_src);

print Dumper($IPB);
print Dumper($SRM);

The results of the dump look something like this

$IPB

[
  {
    Drawing => "1001"
  },
  {
    Drawing => "1002"
  },
  {
    Drawing => "1003"
  }
]

$SRM

[
  {
    Drawing => "1001",
    Figure => "Figure 2-8",
    Index => 2,
    Nomenclature => "Some Part"
  },
  {
    Drawing => "1002",
    Figure => "Figure 2-8",
    Index => 2,
    Nomenclature => "Some Part"
  },
  {
    Drawing => "2001",
    Figure => "Figure 2-8",
    Index => 2,
    Nomenclature => "Some Part"
  },
  {
    Drawing => "2002",
    Figure => "Figure 2-8",
    Index => 2,
    Nomenclature => "Some Part"
  }
]

I want to compare the two arrays based on each hash's Drawing key, and create two CSV files as follows

One containing the items that are in $IPB but not $SRM , containing only the data in the `Drawing column.
Another where the item is in the $SRM but not the $IPB , containing all the fields that are related to the Drawing column.

I have found lots of information to compare files to see if they match, or to compare hashes or arrays for single pieces of data, but I can't find something specific to what I need.

Answer 1

Since drawing is a criterion of sorts, why not "index" the data into something a little more convenient where the drawing index is the key and the corresponding data is a corresponding value?

my %ipb;
for my $record ( @$IPB ) {
    my $index = $record->{Drawing};
    push @{ $ipb{$index} }, $record;
}

my %srm;
for my $record ( @$SRM ) {
    my $index = $record->{Drawing};
    push @{ $srm{$index} }, $record;
}

Now it should be a breeze to figure out the indexes unique to $IPB and $SRM :

use List::MoreUtils 'uniq';
my @unique_ipb = uniq( grep { $ipb{$_} and not $srm{$_} } keys( %ipb ), keys( %srm ) );
my @unique_srm = uniq( grep { $srm{$_} and not $ipb{$_} } keys( %ipb ), keys( %srm ) );

What's common to both?

my @intersect = uniq( grep { $srm{$_} and $ipb{$_} } keys( %ipb ), keys( %srm ) );

What are all the figure number(s) for Drawing index 1002?

print $_->{Figure}, "\n" for @{ $ipb{1002} // [] }, @{ $srm{1002} // [] };

Answer 2

This short program uses your example values for $ipb and $srm and creates the output that I think you want. ( Please don't use capital letters for anything but global identifiers like package names.)

There are a couple of problems

Using Text::CSV::Slurp leaves you with two arrays of hashes that are no use for this task without further indexing. You would be much better off creating appropriate data structures from scratch by processing the file line-by-line
You say that your second file must contain all of the information related to each Drawing key, but, because Perl hashes are inherently unordered, Text::CSV::Slurp has lost the order of the field names. The best that can be done is to print the data in whatever order it is found, but preceding it by a header line showing the field names. This is another reason for avoiding Text::CSV::Slurp

use strict;
use warnings;
use autodie;

# The original data

my $ipb = [{ Drawing => 1001 }, { Drawing => 1002 }, { Drawing => 1003 }];

my $srm = [
  {
    Drawing => "1001",
    Figure => "Figure 2-8",
    Index => 2,
    Nomenclature => "Some Part"
  },
  {
    Drawing => "1002",
    Figure => "Figure 2-8",
    Index => 2,
    Nomenclature => "Some Part"
  },
  {
    Drawing => "2001",
    Figure => "Figure 2-8",
    Index => 2,
    Nomenclature => "Some Part"
  },
  {
    Drawing => "2002",
    Figure => "Figure 2-8",
    Index => 2,
    Nomenclature => "Some Part"
  }
];

# Index the data

my %srm;
for my $item (@$srm) {
  my $drawing = $item->{Drawing};
  $srm{$drawing} = $item;
}

my %ipb;
for my $item (@$ipb) {
  my $drawing = $item->{Drawing};
  $ipb{$drawing} = 1;
}

# Create the output files

open my $csv1, '>', 'file1.csv';
for my $id (sort keys %ipb) {
  next if $srm{$id};
  print $csv1 $id, "\n";
}
close $csv1;

open my $csv2, '>', 'file2.csv';
my @keys = keys %{ $srm->[0] };
print $csv2 join(',', @keys), "\n";
for my $id (sort keys %srm) {
  next if $ipb{$id};
  print $csv2 join(',', @{$srm{$id}}{@keys}), "\n"; 
}
close $csv2;

output

file1.csv

file2.csv

Drawing,Nomenclature,Index,Figure
2001,Some Part,2,Figure 2-8
2002,Some Part,2,Figure 2-8

Answer 3

This is a bit complicated, because your data structures are less than ideal for comparing. You have references to arrays of hash references, and you care about the data in one of the keys of the hashref. My first step would be to flatten IPB to an array (since there is no data under this), and convert SRM to a single hashref.

my @ipbarray = map { ${$_}{Drawing} } $IPB; # Creates an array from IPB.
my $srmhash = {};
for my $hash ($SRM) {
  ${$srmhash}{${$hash}{Drawing}} = $hash unless defined ${$srmhash}{${$hash}{Drawing}}; # Don't overwrite if it exists
}

Now we have 2 more workable data structures.

Next step is to contrast these values:

my @ipbonly = ();
my @srmonly = ();

for my $ipbitem (@ipbarray) {
  push @ipbonly, ( Drawing => $ipbitem } unless defined ${$srmhash}{$ipbtem};
}

for my $srmitem (keys $srmhash) {
  push @srmonly, ${$srmhash}{$srmitem} unless grep { $_ == $srmitem } @ipbarray;
}

At this point, @ipbonly and @srmonly will contain the data you want.

Compare two CSV Files with Perl

Question

3 answers

solution1
1 2014-04-28 20:13:34

solution2
1 ACCPTED 2014-04-28 22:50:22

solution3
0 2014-04-28 20:14:19

Compare two CSV Files with Perl

Question

3 answers

solution1 1 2014-04-28 20:13:34

solution2 1 ACCPTED 2014-04-28 22:50:22

solution3 0 2014-04-28 20:14:19

solution1
1 2014-04-28 20:13:34

solution2
1 ACCPTED 2014-04-28 22:50:22

solution3
0 2014-04-28 20:14:19