简体   繁体   中英

Create single CSV from multiple csvs in a directory 1st CSV copy both columns subsequent csv only the 2nd column

I am looking to create a single csv from many csvs in a directory. I know that this has been covered many times, however I have a slight twist. Things I am looking to do:

  1. Find the largest file.
  2. With the largest file - use that as the base. The first column in the largest file will be the primary key that I need to merge the rest of the files.
  3. Compare the each file in the directory to the primary key from the first CSV and add the second column of each csv to the largest one.

With that being said i am working with the following:

I found this link to take one column from one csv to another.

https://askubuntu.com/questions/553219/add-column-from-one-csv-to-another-csv-file

I can utilize something like this to add the column from one to another.

paste -d, file2 <(cut -d, -f3- file1)

The following PHP will get me the file list for a directory now trying to leverage PHP to combine / merge the csvs.

$dir= $Folder.'/Stats/Latency/'; // directory name 
$ar=scandir($dir); 
$box=$_POST['box'];  // Receive the file list from form

// Looping through the list of selected files ///
while (list ($key,$val) = @each ($box)) {
$path=$dir  ."/".$val;
$dest = $Folder."/Report/Latency/".$val;
if(copy($path, $dest)); //echo "Copy Complete file ";
echo "$val,";
}
echo "<hr>";

This is where i need the CSV merge below: I am debating to utilize the shell exec commands but that seems very labor intensive.

$reportFiles = $Folder."/Report/Latency/";
foreach(glob($reportFiles."*.csv") as $file)
{
   shell_exec("touch "$reportFiles."latencyReport.csv");

}

As it relates to the data in the csv files:

CSV1:

date,vpool06
2016-03-28 12:00:00,0.000
2016-03-28 12:01:00,0.000
2016-03-28 12:02:00,0.000
2016-03-28 12:03:00,0.000
2016-03-28 12:04:00,0.000
2016-03-28 12:05:00,0.000
2016-03-28 12:06:00,0.000
2016-03-28 12:07:00,0.000
2016-03-28 12:08:00,0.000
2016-03-28 12:09:00,0.000
2016-03-28 12:10:00,0.000
2016-03-28 12:11:00,0.000
2016-03-28 12:12:00,0.000
2016-03-28 12:13:00,0.000
2016-03-28 12:14:00,0.000
2016-03-28 12:15:00,0.000
2016-03-28 12:16:00,0.000
2016-03-28 12:17:00,0.000
2016-03-28 12:18:00,0.000
2016-03-28 12:19:00,0.000

CSV2:

date,vpool02
2016-03-28 12:00:00,0.000
2016-03-28 12:01:00,0.000
2016-03-28 12:02:00,0.000
2016-03-28 12:04:00,0.000
2016-03-28 12:05:00,0.000
2016-03-28 12:06:00,0.000
2016-03-28 12:07:00,0.000
2016-03-28 12:08:00,0.000
2016-03-28 12:09:00,0.000
2016-03-28 12:10:00,0.000
2016-03-28 12:11:00,0.000
2016-03-28 12:12:00,0.000
2016-03-28 12:13:00,0.000
2016-03-28 12:14:00,0.000

CSV3:

date,vpool03
2016-03-28 12:00:00,0.000
2016-03-28 12:01:00,0.000
2016-03-28 12:02:00,0.000
2016-03-28 12:04:00,0.000
2016-03-28 12:05:00,0.000

Merged CSV:

date,vpool06,vpool02,vpool03
2016-03-28 12:00:00,0.000,0.000,0.000
2016-03-28 12:01:00,0.000,0.000,0.000
2016-03-28 12:02:00,0.000,0.000,0.000
2016-03-28 12:03:00,0.000,,0.000
2016-03-28 12:04:00,0.000,0.000,0.000
2016-03-28 12:05:00,0.000,0.000,0.000
2016-03-28 12:06:00,0.000,0.000,
2016-03-28 12:07:00,0.000,0.000,
2016-03-28 12:08:00,0.000,0.000,
2016-03-28 12:09:00,0.000,0.000,
2016-03-28 12:10:00,0.000,0.000,
2016-03-28 12:11:00,0.000,0.000,
2016-03-28 12:12:00,0.000,0.000,
2016-03-28 12:13:00,0.000,0.000,
2016-03-28 12:14:00,0.000,0.000,
2016-03-28 12:15:00,0.000,,
2016-03-28 12:16:00,0.000,,
2016-03-28 12:17:00,0.000,,
2016-03-28 12:18:00,0.000,,
2016-03-28 12:19:00,0.000,,

Ideally i don't care if there is a "null" value at this point because it just won't show up in the graph. Meaning that the server was off at the time.

Need it to have null in the spaces where there is no data.
update: example.

date,vpool06,7NA_01,7NA_02,bd01,bd02,vpool01,vpool02,vpool03,vpool04,vpool07
2016-03-28 12:00:00,1.000,null,10.00,02.00,20.00,0.00,0.00,0.00,0.00,0.000
2016-03-28 12:01:00,0.000,11.00,110.00,null,11.00,0.00,0.00,0.00,0.00,0.000
2016-03-28 12:02:00,0.000,null,0.00,2.00,100,0.00,0.00,0.00,0.00,0.000
2016-03-28 12:03:00,0.000,0.00,0.00,02.00,10.00,0.00,0.000,0.00,0.00,0.000

awk to the rescue!

$ awk -F, -v OFS=, 'FNR==1{c++} {a[$1,c]=$2;keys[$1]}
                       END{for(k in keys) 
                            {printf "%s", k; 
                             for(i=1;i<=c;i++) 
                                 printf "%s", OFS (((k,i) in a)?a[k,i]:""); 
                             print ""}}' file{1,2,3} | 
 sort -t, -k1,1 | 
 tee >(sed '$d' > merged) >(tail -1 >> merged) 

$ cat merged

date,vpool06,vpool02,vpool03                                                                                          
2016-03-28 12:00:00,0.000,0.000,0.000                                                                                 
2016-03-28 12:01:00,0.000,0.000,0.000
2016-03-28 12:02:00,0.000,0.000,0.000
2016-03-28 12:03:00,0.000,,
2016-03-28 12:04:00,0.000,0.000,0.000
2016-03-28 12:05:00,0.000,0.000,0.000
2016-03-28 12:06:00,0.000,0.000,
2016-03-28 12:07:00,0.000,0.000,
2016-03-28 12:08:00,0.000,0.000,
2016-03-28 12:09:00,0.000,0.000,
2016-03-28 12:10:00,0.000,0.000,
2016-03-28 12:11:00,0.000,0.000,
2016-03-28 12:12:00,0.000,0.000,
2016-03-28 12:13:00,0.000,0.000,
2016-03-28 12:14:00,0.000,0.000,
2016-03-28 12:15:00,0.000,,
2016-03-28 12:16:00,0.000,,
2016-03-28 12:17:00,0.000,,
2016-03-28 12:18:00,0.000,,
2016-03-28 12:19:00,0.000,,

I've no idea how you'd do that in PHP but with GNU awk for true 2D arrays and sorted "in" it'd be:

$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 { hdr[ARGIND][1]=$1; hdr[ARGIND][2]=$2; next }
{ arr[ARGIND][$1] = $2 }
END {
    for (idx in arr) {
        numRows = length(arr[idx])
        if (numRows > maxRows) {
            maxRows = numRows
            maxIdx  = idx
        }
    }

    printf "%s%s%s", hdr[maxIdx][1], OFS, hdr[maxIdx][2]
    for (idx=1; idx<=ARGIND; idx++) {
        if (idx != maxIdx) {
            printf "%s%s", OFS, hdr[idx][2]
        }
    }
    print ""

    PROCINFO["sorted_in"] = "@ind_str_asc"
    for (tstamp in arr[maxIdx]) {
        printf "%s%s%s", tstamp, OFS, arr[maxIdx][tstamp]
        for (idx=1; idx<=ARGIND; idx++) {
            if (idx != maxIdx) {
                printf "%s%s", OFS, (tstamp in arr[idx] ? arr[idx][tstamp] : "null")
            }
        }
        print ""
    }
}

.

$ awk -f tst.awk csv3 csv2 csv1
date,vpool06,vpool03,vpool02
2016-03-28 12:00:00,0.000,0.000,0.000
2016-03-28 12:01:00,0.000,0.000,0.000
2016-03-28 12:02:00,0.000,0.000,0.000
2016-03-28 12:03:00,0.000,null,null
2016-03-28 12:04:00,0.000,0.000,0.000
2016-03-28 12:05:00,0.000,0.000,0.000
2016-03-28 12:06:00,0.000,null,0.000
2016-03-28 12:07:00,0.000,null,0.000
2016-03-28 12:08:00,0.000,null,0.000
2016-03-28 12:09:00,0.000,null,0.000
2016-03-28 12:10:00,0.000,null,0.000
2016-03-28 12:11:00,0.000,null,0.000
2016-03-28 12:12:00,0.000,null,0.000
2016-03-28 12:13:00,0.000,null,0.000
2016-03-28 12:14:00,0.000,null,0.000
2016-03-28 12:15:00,0.000,null,null
2016-03-28 12:16:00,0.000,null,null
2016-03-28 12:17:00,0.000,null,null
2016-03-28 12:18:00,0.000,null,null
2016-03-28 12:19:00,0.000,null,null

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM