简体   繁体   中英

Split unevenly a CSV file in multiple files in bash scripting

I have a folder with few big csv files and I want to have a variable number of almost equally sized CSV files.

At the moment this is my even sized division implementation:

#!/bin/bash

#copy header to all resulting files parts
head -n 1 $1_2021.csv | awk -v NPROC=$(nproc) '{ for (i = 0; i < NPROC; ++i) print $0 > "file_"i".csv" }'

#copy the data but the header for each file part
tail --silent -n+2 $1* | awk -v NPROC=$(nproc) '{ part = NR % NPROC; print $0 >> "file_"part".csv" }'

where $1 is the version of the files, passed as parameter to the bash script, for instance v1 or v2 . The output filenames are not relevant, currently file_"i".csv & file_"part".csv produce the same filenames, where part & i lay in this range: (0, NPROC)

Some samples of the file v1_2020.csv (semicolon delimited)

DATE;COLOUR;CLOSING;CHANGE;Y  
2020-01-02;r;n;4;119  
2020-01-02;y;n;56;130  
2020-01-03;y;n;3;153  
2020-01-03;r;n;46;192  
2020-01-03;b;n;20;241  
2020-01-04;w;n;1252;252  
2020-01-05;w;n;453;253  
2020-01-06;b;y;1;279  
2020-01-06;b;n;945;294  

Table-wise looks like this:

DATE COLOUR CLOSING CHANGE
2020-01-02 r n 4
2020-01-02 y n 56
2020-01-03 y n 3
2020-01-03 r n 46
2020-01-03 b n 20
2020-01-03 w n 1252
2020-01-05 w n 453
2020-01-06 b y 1
2020-01-06 b n 945

I want to improve this division in such a way that it does not separate into different files the same dates. So it should take into account the DATE column within the CSV file.

Current output with NPROC=2 :

file_1.csv

DATE;COLOUR;CLOSING;CHANGE;Y  
2020-01-02;r;n;4;119  
2020-01-03;y;n;3;153  
2020-01-03;b;n;20;241  
2020-01-05;w;n;453;253  
2020-01-06;b;n;945;294

file_2.csv

DATE;COLOUR;CLOSING;CHANGE;Y  
2020-01-02;y;n;56;130  
2020-01-03;r;n;46;192  
2020-01-04;w;n;1252;252  
2020-01-06;b;y;1;279 

New output with NPROC=2 :

Whatever type of uneven splitting into NPROC number of files such that it does not mix up dates into different files. One date should be just into one file but a file shall contain multiple dates.

For instance, but any other type of splitting into NPROC number of files is fine if it respects the conditions above:

file_1.csv

DATE;COLOUR;CLOSING;CHANGE;Y  
2020-01-02;r;n;4;119  
2020-01-02;y;n;56;130  
2020-01-03;y;n;3;153  
2020-01-03;r;n;46;192  
2020-01-03;b;n;20;241  

file_2.csv

DATE;COLOUR;CLOSING;CHANGE;Y  
2020-01-04;w;n;1252;252  
2020-01-05;w;n;453;253  
2020-01-06;b;y;1;279  
2020-01-06;b;n;945;294

Could you give me any hint regarding a possible solution without using Python but just bash scripting?

If you just want to split a csv and add a header to each split, you can do:

awk -v cnt=6 -F ';' 'FNR==1{header=$0; fn=1}
!(FNR%cnt){
    fn++
    print header >"file_" fn ".csv"
}
{print $0>"file_" fn ".csv"}' file

If you want to split contextually based on the date column (assuming already sorted):

awk -v sp=6 -v fn=1 -F ';' 'FNR==1{header=$0}
cnt++>sp && l1!=$1 {
    fn++
    cnt=0
    print header >"file_" fn ".csv"
}
{print $0>"file_" fn ".csv"; l1=$1}' file

Result of second here:

cat *.csv
DATE;COLOUR;CLOSING;CHANGE
2020-01-02;r;n;4
2020-01-02;y;n;56
2020-01-03;y;n;3
2020-01-03;r;n;46
2020-01-03;b;n;20
2020-01-03;w;n;1252
DATE;COLOUR;CLOSING;CHANGE
2020-01-05;w;n;453
2020-01-06;b;y;1
2020-01-06;b;n;945

First, processing CSV/TSV files with command-line tools can be tricky. The awk command is the go-to here, but it doesn't have built-in support for quoting; if you have a row like column 1; "column 2 has a ';' in it";column 3 column 1; "column 2 has a ';' in it";column 3 column 1; "column 2 has a ';' in it";column 3 , then awk -F';' will see it as $1="column 1" , $2="\\"column to has a '" , $3="'in it\\"" , $4="column3" .

If your data doesn't have anything like that, then it's pretty straightforward. First, you want to write each date to its own file:

 awk -F';'  '{print >>$1".csv"}'

That will get you files named after the date, like 2020-01-02.csv .

Now you can merge those into NPROC files, and as long as you only merge whole files, you won't split data from a given date into multiple files. Here's one simple (and not necessarily elegant!) way to do that:

declare -i lines=$(cat *-*-*.csv | wc -l) chunk cur
(( chunk = lines / NPROC, cur = 1 ))
for f in *-*-*.csv; do
  cat "$f" >>"file_$cur.csv"
  if (( $(wc -l <"file_$cur.csv") >= chunk )); then
     (( cur += 1 ))
  fi
done
awk -F';' -v NPROC=2 '
    NR == 1 {head = $0; next}
    !($1 in dates) {
        n = (n + 1) % NPROC
        file = "out_" n ".csv"
        if (!(file in created)) {
            print head > file
            created[file]
        }
        dates[$1] = file
    }
    { print > dates[$1] }
' v1_2020.csv

Since NPROC = 2, two output files are created:

$ cat out_0.csv
DATE;COLOUR;CLOSING;CHANGE;Y
2020-01-03;y;n;3;153
2020-01-03;r;n;46;192
2020-01-03;b;n;20;241
2020-01-05;w;n;453;253

$ cat out_1.csv
DATE;COLOUR;CLOSING;CHANGE;Y
2020-01-02;r;n;4;119
2020-01-02;y;n;56;130
2020-01-04;w;n;1252;252
2020-01-06;b;y;1;279
2020-01-06;b;n;945;294

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM