簡體   English   中英

如何根據多個條件將大的“ .csv”文件拆分為小文件?

[英]How to break large “.csv” file into small files based on multiple conditions?

我有大的.csv文件(〜40MB),我希望在少數情況下將它們分解成較小的文件,並根據數據將其命名為:

  1. 按第三列的內容分隔文件,
  2. 將第1點的輸出按內容第4列分開,

這是棘手的部分:

  1. 通過2個先前的操作創建的輸出檢查第11列中是否有任何數據,如果是,則將該數據與內容分開,然后按第17列的內容分開->然后保存輸出/ OR / AND /
  2. 如果第11列中沒有數據,請檢查第15列並進行相應的分隔。 下一步檢查第17列,並按第17列分隔此數據->保存輸出。

我在VBA中有這樣的東西,但是對於大文件來說它太慢了,而excel有時崩潰了。 對於像這樣的多個文件,需要花費很多時間才能手動將其縮減,然后將其放入vba中。

是否有可能在如此多的條件下剪切文件?

在此先感謝您的幫助。

例如:(標題為列數)

1       2   3   4   11  15  17
Date        Time    COUNTRY CITY    CHECK   TEST    TEST2
2015-08-20  11:54   ENGLAND ABINGDON        1       1
2015-08-21  12:54   ENGLAND BATLEY          2       5
2015-08-22  13:54   ENGLAND FROME           2       6
2015-08-23  14:54   ENGLAND FROME   2       1
2015-08-24  15:54   USA CALIFORNIA          4       8
2015-08-25  16:54   USA CONNECTICUT         4       9
2015-08-26  17:54   USA DELAWARE    1               8
2015-08-27  18:54   GERMANY SAXONY          6       9
2015-08-28  19:54   GERMANY SAXONY          6       10
2015-08-27  18:54   GERMANY SAXONY          4       11
2015-08-28  19:54   GERMANY SAXONY          4       14
2015-08-29  20:54   GERMANY HESSE                   8
2015-08-29  20:54   GERMANY HESSE   1               8

File1                       
2015-08-20  11:54   ENGLAND ABINGDON        1       1

File2                       
2015-08-21  12:54   ENGLAND BATLEY          2       5

File3                       
2015-08-22  13:54   ENGLAND FROME           2       6

File4                       
2015-08-23  14:54   ENGLAND FROME   2               1

File5                       
2015-08-24  15:54   USA CALIFORNIA          4       8

File6                       
2015-08-25  16:54   USA CONNECTICUT         4       9

File7                       
2015-08-26  17:54   USA DELAWARE    1               8

File8                       
2015-08-27  18:54   GERMANY SAXONY          4       9
2015-08-28  19:54   GERMANY SAXONY          4       10

File9                       
2015-08-27  18:54   GERMANY SAXONY          6       11
2015-08-28  19:54   GERMANY SAXONY          6       14

File10                      
2015-08-29  20:54   GERMANY HESSE                   8

File11                      
2015-08-29  20:54   GERMANY HESSE   1               8

您的數據無處不在! 它不在您描述的列中,也不用制表符分隔。 您讓生活變得輕松!

將此awk與實際數據一起嘗試,以查看它是否生成了可以使用的輸出文件名。

awk -F'\t' '{
    f=$3 "_" $4                # filename = field3 _ field4
    if(length($11)){           # if f11 not null
      f=f "_A_" $11 "_" $17    #    filename = filename _A_ field11 _ field17
    }else{                     # else
      f=f "_B_" $15 "_" $17    #    filename = filename _B_ field15 _ field17
    }
    print f}' file.csv

你應該得到這樣的東西

ENGLAND_ABINGDON_A_3_1
ENGLAND_ABINGDON_A_4_2
GENRMANY_SAXONY_B_5_3

基本上,它使用awk並告訴它字段分隔符是制表符。 然后,它查看每一行,並通過查看字段的描述在變量f構建輸出文件名。

如果這將您的數據按您的意思分開,則實際上您可以通過簡單地更改最后一行來將當前行輸出到具有相應名稱的文件中:

awk -F'\t' '{
    f=$3 "_" $4                # filename = field3 _ field4
    if(length($11)){           # if f11 not null
      f=f "_A_" $11 "_" $17    #    filename = filename _A_ field11 _ field17
    }else{                     # else
      f=f "_B_" $15 "_" $17    #    filename = filename _B_ field15 _ field17
    }
    print > f}' file.csv

基本上,它會打印文件,而不是在更改時打印其名稱

print f

print > f

計算標題

如果您想在每個輸出文件上使用標題,我們將需要更加努力地工作...

首先,我們需要保存原始文件中的標頭,因此如果我們假設記錄號為1,我們將

awk -F'\t' '
    NR==1 {header=$0}           # save first line as header
    {f=$3 "_" $4                # filename = field3 _ field4
    ...
    ... as before

現在,每當開始寫入新文件時,我們都需要輸出標題行,這是“有趣的”,因為我們只是為每行動態創建輸出文件名! 因此,我們需要“記住”已寫入的文件,然后僅在寫入新文件時才發出標頭。 我在這里沒有合適的數據集,所以我猜這有點!

awk -F'\t' '
    NR==1 {header=$0}          # save first line as header
    {f=$3 "_" $4               # filename = field3 _ field4
    if(length($11)){           # if f11 not null
      f=f "_A_" $11 "_" $17    #    filename = filename _A_ field11 _ field17
    }else{                     # else
      f=f "_B_" $15 "_" $17    #    filename = filename _B_ field15 _ field17
    }
    # Emit header if first write to this filename
    if(!(f in fileswritten)){
       fileswritten[f]++         # note that we have written to this file
       print header > f          # emit header
    }
    print > f}' file.csv

這個答案是不完整的,但是粗略地說明了您需要做什么:

#!/bin/bash

# Get list of countries:
countries=`cat file1.csv | cut -f 3 -d$'\t'| grep -v 3 | grep -v COUNTRY | uniq`

for country in ${countries}; do
    # Get list of cities per country:
    cities=`cat file1.csv | grep ${country} | cut -f 4 -d$'\t' | uniq`

    # Get data per country:
    cat file1.csv | grep ${country} > file1-${country}.csv

    # Get data per city per country:
    for city in ${cities}; do
    echo ${country}:${city}

    cat file1.csv | grep ${country} | grep ${city} > file1-${country}-${city}.csv
    done

    # Created output by 2 previous operations check if there is any data in 11th column,
    # if yes then separate this data accordingly to content and after that separate that
    # by content of 17th column -> then save outputs /OR / AND /
    # Column 11 is at position 5 in your data.
    check=`cat file1.csv | grep ${country} | cut -f 5 -d$'\t' | uniq`
    for check in ${checks}; do
        echo ${country}:${city}:${check}

        cat file1.csv | grep ${country} | grep ${city} > file1-${country}-${city}-${check}.csv

        # TODO: Further split this, I assume you get the drift by now.
    done

    # If there is no data in column 11 check column 15th and separate accordingly.
    # Next check 17 column and separate this data by 17th column -> save outputs.
    # TODO: Further split this, I assume you get the drift by now.

done

我建議編寫一個小腳本並使用Java庫CSVFormat:

private static final String[] FILE_HEADER_MAPPING = {"Date", "Time" ,"COUNTRY", .... };
csvFileParser = new CSVParser(fileReader, csvFileFormat);
        List<CSVRecord> csvRecords = csvFileParser.getRecords();

然后訪問第11列,您必須做

 for (int i = 1; i < csvRecords.size(); i++) {
    boolean publishAccount = true;
    CSVRecord record = csvRecords.get(i);
    /// here how to access
    record.get("Fiel column 11");  
 }

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM