簡體   English   中英

將多個 CSV 文件與相同的 Header 合並到不同的組文件中

[英]Combine multiple CSV files with Same Header into different group files

嗨,我正在尋找一種最快的解決方案來處理 csv 文件的負載。

情況:我在一個文件夾中有多個 csv 文件,具有不同的標題

我已經對它們進行了預處理以刪除頂部的垃圾行,因此所有這些都有一個Standard Header

我想將具有完全相同的聽者的CSV 文件組合並到一個新文件夾

Single Folder:
    Tree 
    ├── 161598827330618_data_aa.csv 
    ├── ..............  
    ├── ............... 
    ├── ................ 
    ├── 161598852706227_data_bh.csv 
Note: Filenames are Random with no pattern*

示例文件-1.csv

School Name,Project Description,Construction Award,Project type,Building ID,Building Address,City,Postcode
George HS - QUEENS,New,76850000,CAP,Q298,50-51 98TH STREET,Queens,11368 
MARBLE HILL INTERNATIONAL HS -,EXT MASONRY/FLOOD/PARAPETS/ROOFS,10490000,CIP,X475,99 TERRACE VIEW AVENUE,Bronx,10463
NEW DORP HS - STATEN ISLAND,PARTIAL ACCESSIBILITY,488000,CIP,R435,465 NEW DORP LANE,Staten Island,10306

示例文件-2.csv

School Name,Project Description,Construction Award,Project type,Building ID,Building Address,City,Postcode
EAST SIDE COMMUNITY SCHOOL,FIFTH FLOOR CEILING REPLACEMENT,150000,CIP,M060,420 EAST 12 STREET,Manhattan,10009
RICHMOND HILL HS - QUEENS,STEEL DETERIORATED COLUMS & COLUMN,1064400,CIP,Q475,89-30 114 STREET,Queens,11418
SUCCESS ACADEMY CHARTER SCHOOL,INTERIOR STAIRS,2045000,CIP,M099,410 EAST 100 STREET,Manhattan,10029

示例文件-3.csv

Reporting Period,Project Number,City,County,Zip Code,Sector,Solicitation,Electric Utility
02/28/2021,2453,Youngstown,,14174,Non-Residential,ARRA Projects,National Grid
02/28/2021,218852,Queens,Queens,11356,Residential,PON 2112,Consolidated Edison
02/28/2021,220037,Warwick,Orange,10990,Residential,PON 2112,Orange and Rockland Utilities
02/28/2021,2011-230103-SLPR,Center Moriches,Suffolk,11934,Residential,Solar ARRA Funding,Long Island Power Authority

示例文件-4.csv

Reporting Period,Project Number,City,County,Zip Code,Sector,Solicitation,Electric Utility
02/28/2021,2453,Youngstown,,14174,Non-Residential,ARRA Projects,National Grid
02/28/2021,218852,Queens,Queens,11356,Residential,PON 2112,Consolidated Edison
02/28/2021,220037,Warwick,Orange,10990,Residential,PON 2112,Orange and Rockland Utilities
02/28/2021,2011-230103-SLPR,Center Moriches,Suffolk,11934,Residential,Solar ARRA Funding,Long Island Power Authority

示例文件-5.csv

OBJECTID,Borough,PSSite,ParkName,ParkZone,PSStatus,GlobalID,CreatedDate,UpdatedDate
283721,Brooklyn,Street,,,Populated,C90AAD08-D99E-4759-A64C-219D6143BFB3,07-08-15 13:10,12/20/2019 04:34:58 PM
7669836,Queens,Park,Astoria Park,Q004-ZN02,Empty,AB55A658-8276-4734-A698-5FFCAE96578E,08/13/2020 01:18:00 PM,08/20/2020 06:15:32 PM
7123408,Brooklyn,Park,Asser Levy Park,,Populated,B32D93C9-5958-4129-A87A-FA7C9A5A4E87,01-09-20 13:15,01-09-20 13:17

示例文件-6.csv

OBJECTID,Borough,PSSite,ParkName,ParkZone,PSStatus,GlobalID,CreatedDate,UpdatedDate
6036681,Manhattan,Park,Riverside Park,,Populated,6A3E747D-CD5E-43EB-9789-67DB2064E878,04-11-18 11:11,08-06-20 21:21
7170578,Bronx,Park,Garden Of Eden,,Populated,B1E8B660-4B65-437F-B61F-06B1B71A4E1C,01/28/2020 03:18:00 PM,01/28/2020 03:19:26 PM
740416,Bronx,Park,Mullaly Park,X034-ZN02,Populated,E8F51E3B-CC6F-46A3-AF17-02B6BE8DCC57,08/26/2015 04:34:00 PM,01/30/2020 04:10:41 PM
5004669,Queens,Street,,,Populated,20157769-88EC-4867-9F50-852EF4814BF0,11-02-16 16:56,08-03-20 13:12:00 AM

示例文件-7.csv

Indicator,Group,State,Subgroup,Phase,Time Period,Time Period Label,Value,Low CI,High CI,Confidence Interval
Private Health Insurance Coverage,National Estimate,United States,United States,1,1,Apr 23 - May 5,75.4,74.7,76.2,74.7 - 76.2
Public Health Insurance Coverage,By Age,United States,18 - 24 years,1,1,Apr 23 - May 5,19.5,15.4,24.3,15.4 - 24.3
Uninsured at the Time of Interview,By Gender,United States,Female,1,1,Apr 23 - May 5,11,10.3,11.7,10.3 - 11.7

示例文件-8.csv

Year, dtmSurveyDate, ColonyID, strAOUCode, Type, strPhotoInterpreters, strColonyName, strCounty, strState, strCountry
2014,03-Jun-14,219-001,COMU,Image Check - No Birds,Kirsten Bixler,"""Tillamook Head Rocks"" (Eastern Rocks)",Clatsop County,Oregon,United States
2014,03-Jun-14,219-002,COMU,Image Check - No Birds,Kirsten Bixler,"""Tillamook Head Rocks"" (Northern Rock)",Clatsop County,Oregon,United States
2014,03-Jun-14,219-003,COMU,Shapefile-RawCount,Kirsten Bixler,"""Tillamook Head Rocks"" (Southwestern Rocks)",Clatsop County,Oregon,United States
2014,03-Jun-14,219-005,COMU,Shapefile,Shawn W. Stephensen,Tillamook Rock,Clatsop County,Oregon,United States

期望的結果:

示例文件-1.csv }
示例文件-2.csv } Header 檢查 > Same Header found > 合並 > GROUP-1.csv

示例文件-3.csv }
示例文件-4.csv } Header 檢查 > Same Header found > 合並 > GROUP-2.csv

示例文件-5.csv }
示例文件-6.csv } Header 檢查 > Same Header found > 合並 > GROUP-3.csv

示例文件-7.csv } Header 檢查 > No Similar Headers > 沒有 > GROUP-5.csv
示例文件-8.csv } Header 檢查 > No Similar Headers > 沒有 > GROUP-6.csv

首選解決方案: Bash 腳本和 Linux 命令解決方案嘗試:

#!/bin/bash
awk '
  FNR==1{
    if (!($0 in h)||file!=h[$0]){close(file)}
    if (!($0 in h)){file=h[$0]=i++}
    else{file=h[$0];next}
  }
  {print >> (file)}
' ./*.csv

https://unix.stackexchange.com/a/602291/459978

上述方法有效,但我不確定它是否可以處理 1000 多個文件進行處理和分類。 我需要在不同的文件夾中創建 Group*.csv 格式。

最短完成時間很重要https://stackoverflow.com/a/51921621/3088275

Looking for an Op code with Awk or Sed or Linux commands for bash script which ever is fastest to achieve the desired output

您可以使用以下 AWK 腳本。 我用所有示例文件對其進行了測試。

!NF { next }
NR % 3 == 1 { filename = substr($0, 5, length($0)-8) }
NR % 3 == 2 { headers[$0] = headers[$0] (headers[$0] == "" ? "" : ",") filename }

END {
  i=1
  for (header in headers) {\
    printf("Group %02d: %s\n", i, headers[header])
    split(headers[header], a, ",")
    for (idx in a) {
      getline x < a[idx]
      printf "" > sprintf("group%02d.txt", i)
      while (getline x < a[idx] > 0)
        print x >> (sprintf("group%02d.txt", i))
    }
    i++
  }
}
$ ls Sample\ File-*
Sample File-1.csv Sample File-3.csv Sample File-5.csv Sample File-7.csv
Sample File-2.csv Sample File-4.csv Sample File-6.csv Sample File-8.csv
$ head -n 1 Sample\ File-* | awk -f script.awk
Group 01: Sample File-1.csv,Sample File-2.csv
Group 02: Sample File-8.csv
Group 03: Sample File-7.csv
Group 04: Sample File-5.csv,Sample File-6.csv
Group 05: Sample File-3.csv,Sample File-4.csv
$ cat group01.txt
George HS - QUEENS,New,76850000,CAP,Q298,50-51 98TH STREET,Queens,11368
MARBLE HILL INTERNATIONAL HS -,EXT MASONRY/FLOOD/PARAPETS/ROOFS,10490000,CIP,X475,99 TERRACE VIEW AVENUE,Bronx,10463
NEW DORP HS - STATEN ISLAND,PARTIAL ACCESSIBILITY,488000,CIP,R435,465 NEW DORP LANE,Staten Island,10306
EAST SIDE COMMUNITY SCHOOL,FIFTH FLOOR CEILING REPLACEMENT,150000,CIP,M060,420 EAST 12 STREET,Manhattan,10009
RICHMOND HILL HS - QUEENS,STEEL DETERIORATED COLUMS & COLUMN,1064400,CIP,Q475,89-30 114 STREET,Queens,11418
SUCCESS ACADEMY CHARTER SCHOOL,INTERIOR STAIRS,2045000,CIP,M099,410 EAST 100 STREET,Manhattan,10029
$ cat group02.txt
02/28/2021,2453,Youngstown,,14174,Non-Residential,ARRA Projects,National Grid
02/28/2021,218852,Queens,Queens,11356,Residential,PON 2112,Consolidated Edison
02/28/2021,220037,Warwick,Orange,10990,Residential,PON 2112,Orange and Rockland Utilities
02/28/2021,2011-230103-SLPR,Center Moriches,Suffolk,11934,Residential,Solar ARRA Funding,Long Island Power Authority

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM