I have a folder with few big csv files and I want to have a variable number of almost equally sized CSV files.
At the moment this is my even sized division implementation:
#!/bin/bash
#copy header to all resulting files parts
head -n 1 $1_2021.csv | awk -v NPROC=$(nproc) '{ for (i = 0; i < NPROC; ++i) print $0 > "file_"i".csv" }'
#copy the data but the header for each file part
tail --silent -n+2 $1* | awk -v NPROC=$(nproc) '{ part = NR % NPROC; print $0 >> "file_"part".csv" }'
where $1
is the version of the files, passed as parameter to the bash script, for instance v1
or v2
. The output filenames are not relevant, currently file_"i".csv
& file_"part".csv
produce the same filenames, where part
& i
lay in this range: (0, NPROC)
Some samples of the file v1_2020.csv
(semicolon delimited)
DATE;COLOUR;CLOSING;CHANGE;Y
2020-01-02;r;n;4;119
2020-01-02;y;n;56;130
2020-01-03;y;n;3;153
2020-01-03;r;n;46;192
2020-01-03;b;n;20;241
2020-01-04;w;n;1252;252
2020-01-05;w;n;453;253
2020-01-06;b;y;1;279
2020-01-06;b;n;945;294
Table-wise looks like this:
DATE | COLOUR | CLOSING | CHANGE |
---|---|---|---|
2020-01-02 | r | n | 4 |
2020-01-02 | y | n | 56 |
2020-01-03 | y | n | 3 |
2020-01-03 | r | n | 46 |
2020-01-03 | b | n | 20 |
2020-01-03 | w | n | 1252 |
2020-01-05 | w | n | 453 |
2020-01-06 | b | y | 1 |
2020-01-06 | b | n | 945 |
I want to improve this division in such a way that it does not separate into different files the same dates. So it should take into account the DATE
column within the CSV file.
NPROC=2
: file_1.csv
DATE;COLOUR;CLOSING;CHANGE;Y
2020-01-02;r;n;4;119
2020-01-03;y;n;3;153
2020-01-03;b;n;20;241
2020-01-05;w;n;453;253
2020-01-06;b;n;945;294
file_2.csv
DATE;COLOUR;CLOSING;CHANGE;Y
2020-01-02;y;n;56;130
2020-01-03;r;n;46;192
2020-01-04;w;n;1252;252
2020-01-06;b;y;1;279
NPROC=2
: Whatever type of uneven splitting into NPROC
number of files such that it does not mix up dates into different files. One date should be just into one file but a file shall contain multiple dates.
For instance, but any other type of splitting into NPROC
number of files is fine if it respects the conditions above:
file_1.csv
DATE;COLOUR;CLOSING;CHANGE;Y
2020-01-02;r;n;4;119
2020-01-02;y;n;56;130
2020-01-03;y;n;3;153
2020-01-03;r;n;46;192
2020-01-03;b;n;20;241
file_2.csv
DATE;COLOUR;CLOSING;CHANGE;Y
2020-01-04;w;n;1252;252
2020-01-05;w;n;453;253
2020-01-06;b;y;1;279
2020-01-06;b;n;945;294
Could you give me any hint regarding a possible solution without using Python but just bash scripting?
If you just want to split a csv and add a header to each split, you can do:
awk -v cnt=6 -F ';' 'FNR==1{header=$0; fn=1}
!(FNR%cnt){
fn++
print header >"file_" fn ".csv"
}
{print $0>"file_" fn ".csv"}' file
If you want to split contextually based on the date column (assuming already sorted):
awk -v sp=6 -v fn=1 -F ';' 'FNR==1{header=$0}
cnt++>sp && l1!=$1 {
fn++
cnt=0
print header >"file_" fn ".csv"
}
{print $0>"file_" fn ".csv"; l1=$1}' file
Result of second here:
cat *.csv
DATE;COLOUR;CLOSING;CHANGE
2020-01-02;r;n;4
2020-01-02;y;n;56
2020-01-03;y;n;3
2020-01-03;r;n;46
2020-01-03;b;n;20
2020-01-03;w;n;1252
DATE;COLOUR;CLOSING;CHANGE
2020-01-05;w;n;453
2020-01-06;b;y;1
2020-01-06;b;n;945
First, processing CSV/TSV files with command-line tools can be tricky. The awk
command is the go-to here, but it doesn't have built-in support for quoting; if you have a row like column 1; "column 2 has a ';' in it";column 3
column 1; "column 2 has a ';' in it";column 3
column 1; "column 2 has a ';' in it";column 3
, then awk -F';'
will see it as $1="column 1"
, $2="\\"column to has a '"
, $3="'in it\\""
, $4="column3"
.
If your data doesn't have anything like that, then it's pretty straightforward. First, you want to write each date to its own file:
awk -F';' '{print >>$1".csv"}'
That will get you files named after the date, like 2020-01-02.csv
.
Now you can merge those into NPROC files, and as long as you only merge whole files, you won't split data from a given date into multiple files. Here's one simple (and not necessarily elegant!) way to do that:
declare -i lines=$(cat *-*-*.csv | wc -l) chunk cur
(( chunk = lines / NPROC, cur = 1 ))
for f in *-*-*.csv; do
cat "$f" >>"file_$cur.csv"
if (( $(wc -l <"file_$cur.csv") >= chunk )); then
(( cur += 1 ))
fi
done
awk -F';' -v NPROC=2 '
NR == 1 {head = $0; next}
!($1 in dates) {
n = (n + 1) % NPROC
file = "out_" n ".csv"
if (!(file in created)) {
print head > file
created[file]
}
dates[$1] = file
}
{ print > dates[$1] }
' v1_2020.csv
Since NPROC = 2, two output files are created:
$ cat out_0.csv
DATE;COLOUR;CLOSING;CHANGE;Y
2020-01-03;y;n;3;153
2020-01-03;r;n;46;192
2020-01-03;b;n;20;241
2020-01-05;w;n;453;253
$ cat out_1.csv
DATE;COLOUR;CLOSING;CHANGE;Y
2020-01-02;r;n;4;119
2020-01-02;y;n;56;130
2020-01-04;w;n;1252;252
2020-01-06;b;y;1;279
2020-01-06;b;n;945;294
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.