简体   繁体   中英

Separate one large CSV file in smaller files based on the first column in bash

I have several large CSV files (~20 MiB each) containing information as the following. I would like to find a way to separate this file into smaller files based on the date in the first column. For example: the following segment will be separated into 2 files, namely 20130719.csv and 20130720.csv .

I also would like to sort (within each smaller files) according to the 4th column (the color tag). Does anyone has any suggestions on how I can do this ?

Are there things I should learn about when dealing with these types of stuff?

19/07/2013  19:14:24:523    6.35099E+17 Dr_Blue 10.42496014 27.17010689 0.685520172
19/07/2013  19:18:5:903 6.35099E+17 Dr_Yellow   11.09363079 28.57788467 2.010284424
19/07/2013  19:36:33:645    6.35099E+17 Dr_Blue 10.77513885 28.3723774  1.897870064
19/07/2013  21:29:36:762    6.35099E+17 Dr_Yellow   10.64018059 28.56962967 1.117245913
19/07/2013  21:29:37:627    6.35099E+17 Dr_Yellow   11.3354435  27.57170868 1.552354813
20/07/2013  2:34:28:2   6.35099E+17 Dr_Yellow   10.41067123 26.84050369 0.919301987
20/07/2013  2:34:28:840 6.35099E+17 Dr_Yellow   10.54369164 27.17712402 0.573934555
20/07/2013  2:34:33:192 6.35099E+17 Dr_Yellow   10.98471832 28.35677719 1.497600555
20/07/2013  4:20:28:246 6.35099E+17 Dr_Blue 10.92816448 28.55761147 2.187088013

Here is a simplified shell version

IFS="$IFS/"
while read DAY MO YR A B C D E F || [ "$DAY" ]; do
  echo "$A $B $C $D $E $F" >> "$YR$MO$DAY.ssv"
done <infile

for x in *.ssv; do
  sort -k4 $x |tr " " "," > ${x%.ssv}.csv
  rm $x
done

for sorting on the fly, awk may be a better choice depending on how the lines are sorted

'csplit' does almost what you need, but you need to know the date ranges to write the regex to split on (you can easily get them with 'head' and 'tail' if you go down this route. If you don't know them there is still the awk one-liner:

{ print $0 > gensub(/\//, ".", "g", $1) ".csv"; }

which puts the entire line $0 into a file named $1.csv. If your date is specified in a funny way with special characters you may need to massage it to be acceptable for you OS. The 'gensub' replaces forward slashes with dots.

As to sorting on the colour tag: you have the shell utility 'sort -k4,4' to specify you only want to sort on the fourth field, but the alphabetical ordering you get may not be what you want. Then there is 'awk' again, though I find that sorting with awk's dynamic arrays (you dump all your lines into an array and then call 'asort' on it in the END rule) isn't lightning fast.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM