简体   繁体   English

从目录中的多个csv创建单个CSV 1st CSV将两列复制到csv之后,仅第二列

[英]Create single CSV from multiple csvs in a directory 1st CSV copy both columns subsequent csv only the 2nd column

I am looking to create a single csv from many csvs in a directory. 我正在寻找从目录中的许多csv创建单个csv。 I know that this has been covered many times, however I have a slight twist. 我知道这个问题已经被讨论过很多次了,但是我有一点点曲解。 Things I am looking to do: 我想要做的事情:

  1. Find the largest file. 查找最大的文件。
  2. With the largest file - use that as the base. 使用最大的文件-将其用作基础。 The first column in the largest file will be the primary key that I need to merge the rest of the files. 最大文件中的第一列将是我合并其余文件所需的主键。
  3. Compare the each file in the directory to the primary key from the first CSV and add the second column of each csv to the largest one. 将目录中的每个文件与第一个CSV中的主键进行比较,然后将每个csv的第二列添加到最大的CSV中。

With that being said i am working with the following: 话虽如此,我正在与以下工作:

I found this link to take one column from one csv to another. 我发现此链接将一列从一个csv转移到另一列。

https://askubuntu.com/questions/553219/add-column-from-one-csv-to-another-csv-file https://askubuntu.com/questions/553219/add-column-from-one-csv-to-another-csv-file

I can utilize something like this to add the column from one to another. 我可以利用类似的东西将列从一个添加到另一个。

paste -d, file2 <(cut -d, -f3- file1)

The following PHP will get me the file list for a directory now trying to leverage PHP to combine / merge the csvs. 以下PHP将为我获取目录的文件列表,现在尝试利用PHP组合/合并csvs。

$dir= $Folder.'/Stats/Latency/'; // directory name 
$ar=scandir($dir); 
$box=$_POST['box'];  // Receive the file list from form

// Looping through the list of selected files ///
while (list ($key,$val) = @each ($box)) {
$path=$dir  ."/".$val;
$dest = $Folder."/Report/Latency/".$val;
if(copy($path, $dest)); //echo "Copy Complete file ";
echo "$val,";
}
echo "<hr>";

This is where i need the CSV merge below: I am debating to utilize the shell exec commands but that seems very labor intensive. 这是我需要以下CSV合并的地方:我正在辩论利用shell exec命令,但这似乎非常耗费人力。

$reportFiles = $Folder."/Report/Latency/";
foreach(glob($reportFiles."*.csv") as $file)
{
   shell_exec("touch "$reportFiles."latencyReport.csv");

}

As it relates to the data in the csv files: 由于它与csv文件中的数据有关:

CSV1: CSV1:

date,vpool06
2016-03-28 12:00:00,0.000
2016-03-28 12:01:00,0.000
2016-03-28 12:02:00,0.000
2016-03-28 12:03:00,0.000
2016-03-28 12:04:00,0.000
2016-03-28 12:05:00,0.000
2016-03-28 12:06:00,0.000
2016-03-28 12:07:00,0.000
2016-03-28 12:08:00,0.000
2016-03-28 12:09:00,0.000
2016-03-28 12:10:00,0.000
2016-03-28 12:11:00,0.000
2016-03-28 12:12:00,0.000
2016-03-28 12:13:00,0.000
2016-03-28 12:14:00,0.000
2016-03-28 12:15:00,0.000
2016-03-28 12:16:00,0.000
2016-03-28 12:17:00,0.000
2016-03-28 12:18:00,0.000
2016-03-28 12:19:00,0.000

CSV2: CSV2:

date,vpool02
2016-03-28 12:00:00,0.000
2016-03-28 12:01:00,0.000
2016-03-28 12:02:00,0.000
2016-03-28 12:04:00,0.000
2016-03-28 12:05:00,0.000
2016-03-28 12:06:00,0.000
2016-03-28 12:07:00,0.000
2016-03-28 12:08:00,0.000
2016-03-28 12:09:00,0.000
2016-03-28 12:10:00,0.000
2016-03-28 12:11:00,0.000
2016-03-28 12:12:00,0.000
2016-03-28 12:13:00,0.000
2016-03-28 12:14:00,0.000

CSV3: CSV3:

date,vpool03
2016-03-28 12:00:00,0.000
2016-03-28 12:01:00,0.000
2016-03-28 12:02:00,0.000
2016-03-28 12:04:00,0.000
2016-03-28 12:05:00,0.000

Merged CSV: 合并的CSV:

date,vpool06,vpool02,vpool03
2016-03-28 12:00:00,0.000,0.000,0.000
2016-03-28 12:01:00,0.000,0.000,0.000
2016-03-28 12:02:00,0.000,0.000,0.000
2016-03-28 12:03:00,0.000,,0.000
2016-03-28 12:04:00,0.000,0.000,0.000
2016-03-28 12:05:00,0.000,0.000,0.000
2016-03-28 12:06:00,0.000,0.000,
2016-03-28 12:07:00,0.000,0.000,
2016-03-28 12:08:00,0.000,0.000,
2016-03-28 12:09:00,0.000,0.000,
2016-03-28 12:10:00,0.000,0.000,
2016-03-28 12:11:00,0.000,0.000,
2016-03-28 12:12:00,0.000,0.000,
2016-03-28 12:13:00,0.000,0.000,
2016-03-28 12:14:00,0.000,0.000,
2016-03-28 12:15:00,0.000,,
2016-03-28 12:16:00,0.000,,
2016-03-28 12:17:00,0.000,,
2016-03-28 12:18:00,0.000,,
2016-03-28 12:19:00,0.000,,

Ideally i don't care if there is a "null" value at this point because it just won't show up in the graph. 理想情况下,我不在乎此时是否存在“空”值,因为它不会显示在图表中。 Meaning that the server was off at the time. 这意味着服务器当时处于关闭状态。

Need it to have null in the spaces where there is no data. 需要它在没有数据的空间中具有null。
update: example. 更新:示例。

date,vpool06,7NA_01,7NA_02,bd01,bd02,vpool01,vpool02,vpool03,vpool04,vpool07
2016-03-28 12:00:00,1.000,null,10.00,02.00,20.00,0.00,0.00,0.00,0.00,0.000
2016-03-28 12:01:00,0.000,11.00,110.00,null,11.00,0.00,0.00,0.00,0.00,0.000
2016-03-28 12:02:00,0.000,null,0.00,2.00,100,0.00,0.00,0.00,0.00,0.000
2016-03-28 12:03:00,0.000,0.00,0.00,02.00,10.00,0.00,0.000,0.00,0.00,0.000

awk to the rescue! awk解救!

$ awk -F, -v OFS=, 'FNR==1{c++} {a[$1,c]=$2;keys[$1]}
                       END{for(k in keys) 
                            {printf "%s", k; 
                             for(i=1;i<=c;i++) 
                                 printf "%s", OFS (((k,i) in a)?a[k,i]:""); 
                             print ""}}' file{1,2,3} | 
 sort -t, -k1,1 | 
 tee >(sed '$d' > merged) >(tail -1 >> merged) 

$ cat merged

date,vpool06,vpool02,vpool03                                                                                          
2016-03-28 12:00:00,0.000,0.000,0.000                                                                                 
2016-03-28 12:01:00,0.000,0.000,0.000
2016-03-28 12:02:00,0.000,0.000,0.000
2016-03-28 12:03:00,0.000,,
2016-03-28 12:04:00,0.000,0.000,0.000
2016-03-28 12:05:00,0.000,0.000,0.000
2016-03-28 12:06:00,0.000,0.000,
2016-03-28 12:07:00,0.000,0.000,
2016-03-28 12:08:00,0.000,0.000,
2016-03-28 12:09:00,0.000,0.000,
2016-03-28 12:10:00,0.000,0.000,
2016-03-28 12:11:00,0.000,0.000,
2016-03-28 12:12:00,0.000,0.000,
2016-03-28 12:13:00,0.000,0.000,
2016-03-28 12:14:00,0.000,0.000,
2016-03-28 12:15:00,0.000,,
2016-03-28 12:16:00,0.000,,
2016-03-28 12:17:00,0.000,,
2016-03-28 12:18:00,0.000,,
2016-03-28 12:19:00,0.000,,

I've no idea how you'd do that in PHP but with GNU awk for true 2D arrays and sorted "in" it'd be: 我不知道您将如何在PHP中执行此操作,但是对于真正的2D数组并使用GNU awk进行排序并“归类”为:

$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 { hdr[ARGIND][1]=$1; hdr[ARGIND][2]=$2; next }
{ arr[ARGIND][$1] = $2 }
END {
    for (idx in arr) {
        numRows = length(arr[idx])
        if (numRows > maxRows) {
            maxRows = numRows
            maxIdx  = idx
        }
    }

    printf "%s%s%s", hdr[maxIdx][1], OFS, hdr[maxIdx][2]
    for (idx=1; idx<=ARGIND; idx++) {
        if (idx != maxIdx) {
            printf "%s%s", OFS, hdr[idx][2]
        }
    }
    print ""

    PROCINFO["sorted_in"] = "@ind_str_asc"
    for (tstamp in arr[maxIdx]) {
        printf "%s%s%s", tstamp, OFS, arr[maxIdx][tstamp]
        for (idx=1; idx<=ARGIND; idx++) {
            if (idx != maxIdx) {
                printf "%s%s", OFS, (tstamp in arr[idx] ? arr[idx][tstamp] : "null")
            }
        }
        print ""
    }
}

.

$ awk -f tst.awk csv3 csv2 csv1
date,vpool06,vpool03,vpool02
2016-03-28 12:00:00,0.000,0.000,0.000
2016-03-28 12:01:00,0.000,0.000,0.000
2016-03-28 12:02:00,0.000,0.000,0.000
2016-03-28 12:03:00,0.000,null,null
2016-03-28 12:04:00,0.000,0.000,0.000
2016-03-28 12:05:00,0.000,0.000,0.000
2016-03-28 12:06:00,0.000,null,0.000
2016-03-28 12:07:00,0.000,null,0.000
2016-03-28 12:08:00,0.000,null,0.000
2016-03-28 12:09:00,0.000,null,0.000
2016-03-28 12:10:00,0.000,null,0.000
2016-03-28 12:11:00,0.000,null,0.000
2016-03-28 12:12:00,0.000,null,0.000
2016-03-28 12:13:00,0.000,null,0.000
2016-03-28 12:14:00,0.000,null,0.000
2016-03-28 12:15:00,0.000,null,null
2016-03-28 12:16:00,0.000,null,null
2016-03-28 12:17:00,0.000,null,null
2016-03-28 12:18:00,0.000,null,null
2016-03-28 12:19:00,0.000,null,null

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM