Split unevenly a CSV file in multiple files in bash scripting

Question

I have a folder with few big csv files and I want to have a variable number of almost equally sized CSV files.

At the moment this is my even sized division implementation:

#!/bin/bash

#copy header to all resulting files parts
head -n 1 $1_2021.csv | awk -v NPROC=$(nproc) '{ for (i = 0; i < NPROC; ++i) print $0 > "file_"i".csv" }'

#copy the data but the header for each file part
tail --silent -n+2 $1* | awk -v NPROC=$(nproc) '{ part = NR % NPROC; print $0 >> "file_"part".csv" }'

where $1 is the version of the files, passed as parameter to the bash script, for instance v1 or v2 . The output filenames are not relevant, currently file_"i".csv & file_"part".csv produce the same filenames, where part & i lay in this range: (0, NPROC)

Some samples of the file v1_2020.csv (semicolon delimited)

DATE;COLOUR;CLOSING;CHANGE;Y  
2020-01-02;r;n;4;119  
2020-01-02;y;n;56;130  
2020-01-03;y;n;3;153  
2020-01-03;r;n;46;192  
2020-01-03;b;n;20;241  
2020-01-04;w;n;1252;252  
2020-01-05;w;n;453;253  
2020-01-06;b;y;1;279  
2020-01-06;b;n;945;294

Table-wise looks like this:

DATE	COLOUR	CLOSING	CHANGE
2020-01-02	r	n	4
2020-01-02	y	n	56
2020-01-03	y	n	3
2020-01-03	r	n	46
2020-01-03	b	n	20
2020-01-03	w	n	1252
2020-01-05	w	n	453
2020-01-06	b	y	1
2020-01-06	b	n	945

I want to improve this division in such a way that it does not separate into different files the same dates. So it should take into account the DATE column within the CSV file.

Current output with `NPROC=2` :

file_1.csv

DATE;COLOUR;CLOSING;CHANGE;Y  
2020-01-02;r;n;4;119  
2020-01-03;y;n;3;153  
2020-01-03;b;n;20;241  
2020-01-05;w;n;453;253  
2020-01-06;b;n;945;294

file_2.csv

DATE;COLOUR;CLOSING;CHANGE;Y  
2020-01-02;y;n;56;130  
2020-01-03;r;n;46;192  
2020-01-04;w;n;1252;252  
2020-01-06;b;y;1;279

New output with `NPROC=2` :

Whatever type of uneven splitting into NPROC number of files such that it does not mix up dates into different files. One date should be just into one file but a file shall contain multiple dates.

For instance, but any other type of splitting into NPROC number of files is fine if it respects the conditions above:

file_1.csv

DATE;COLOUR;CLOSING;CHANGE;Y  
2020-01-02;r;n;4;119  
2020-01-02;y;n;56;130  
2020-01-03;y;n;3;153  
2020-01-03;r;n;46;192  
2020-01-03;b;n;20;241

file_2.csv

DATE;COLOUR;CLOSING;CHANGE;Y  
2020-01-04;w;n;1252;252  
2020-01-05;w;n;453;253  
2020-01-06;b;y;1;279  
2020-01-06;b;n;945;294

Could you give me any hint regarding a possible solution without using Python but just bash scripting?

Answer 1

If you just want to split a csv and add a header to each split, you can do:

awk -v cnt=6 -F ';' 'FNR==1{header=$0; fn=1}
!(FNR%cnt){
    fn++
    print header >"file_" fn ".csv"
}
{print $0>"file_" fn ".csv"}' file

If you want to split contextually based on the date column (assuming already sorted):

awk -v sp=6 -v fn=1 -F ';' 'FNR==1{header=$0}
cnt++>sp && l1!=$1 {
    fn++
    cnt=0
    print header >"file_" fn ".csv"
}
{print $0>"file_" fn ".csv"; l1=$1}' file

Result of second here:

cat *.csv
DATE;COLOUR;CLOSING;CHANGE
2020-01-02;r;n;4
2020-01-02;y;n;56
2020-01-03;y;n;3
2020-01-03;r;n;46
2020-01-03;b;n;20
2020-01-03;w;n;1252
DATE;COLOUR;CLOSING;CHANGE
2020-01-05;w;n;453
2020-01-06;b;y;1
2020-01-06;b;n;945

Answer 2

First, processing CSV/TSV files with command-line tools can be tricky. The awk command is the go-to here, but it doesn't have built-in support for quoting; if you have a row like column 1; "column 2 has a ';' in it";column 3 column 1; "column 2 has a ';' in it";column 3 column 1; "column 2 has a ';' in it";column 3 , then awk -F';' will see it as $1="column 1" , $2="\\"column to has a '" , $3="'in it\\"" , $4="column3" .

If your data doesn't have anything like that, then it's pretty straightforward. First, you want to write each date to its own file:

 awk -F';'  '{print >>$1".csv"}'

That will get you files named after the date, like 2020-01-02.csv .

Now you can merge those into NPROC files, and as long as you only merge whole files, you won't split data from a given date into multiple files. Here's one simple (and not necessarily elegant!) way to do that:

declare -i lines=$(cat *-*-*.csv | wc -l) chunk cur
(( chunk = lines / NPROC, cur = 1 ))
for f in *-*-*.csv; do
  cat "$f" >>"file_$cur.csv"
  if (( $(wc -l <"file_$cur.csv") >= chunk )); then
     (( cur += 1 ))
  fi
done

Answer 3

awk -F';' -v NPROC=2 '
    NR == 1 {head = $0; next}
    !($1 in dates) {
        n = (n + 1) % NPROC
        file = "out_" n ".csv"
        if (!(file in created)) {
            print head > file
            created[file]
        }
        dates[$1] = file
    }
    { print > dates[$1] }
' v1_2020.csv

Since NPROC = 2, two output files are created:

$ cat out_0.csv
DATE;COLOUR;CLOSING;CHANGE;Y
2020-01-03;y;n;3;153
2020-01-03;r;n;46;192
2020-01-03;b;n;20;241
2020-01-05;w;n;453;253

$ cat out_1.csv
DATE;COLOUR;CLOSING;CHANGE;Y
2020-01-02;r;n;4;119
2020-01-02;y;n;56;130
2020-01-04;w;n;1252;252
2020-01-06;b;y;1;279
2020-01-06;b;n;945;294

Split unevenly a CSV file in multiple files in bash scripting

Question

Current output with `NPROC=2` :

New output with `NPROC=2` :

3 answers

solution1
2 2021-07-13 14:02:23

solution2
1 2021-07-13 13:52:29

solution3
1 ACCPTED 2021-07-13 14:38:42

Split unevenly a CSV file in multiple files in bash scripting

Question

Current output with NPROC=2 :

New output with NPROC=2 :

3 answers

solution1 2 2021-07-13 14:02:23

solution2 1 2021-07-13 13:52:29

solution3 1 ACCPTED 2021-07-13 14:38:42

Current output with `NPROC=2` :

New output with `NPROC=2` :

solution1
2 2021-07-13 14:02:23

solution2
1 2021-07-13 13:52:29

solution3
1 ACCPTED 2021-07-13 14:38:42