简体   繁体   中英

Sorting a columns value from a large csv(more than a million records) using awk or bash

I am new to shell scripting.

I have a huge csv file which contains more than 100k rows. I need to find a column and sort it and write it to another file and later I need to process this new file. below is the sample data

"PT3QB789TSUIDF371261","THE TORONTO,DOMINION BANK","HZSN7FQBPO5IEWYIGC72","MAS,CA.ON.OSC,ASIC*,AAAA","XVCCCCCCCCCCYYUUUUU"
"11111111111111111111","ABC,XYZ,QWE","HZSN7FQBPO5IEWYIGC72","POU,ABC,MAS,CA.QC.OSC,CA.ON.OSC","XVRRRRRRRRTTTTTTTTTTTTT"
"22222222222222222222","BHC,NBC,MKY","HZSN7FQBPO5IEWYIGC72","BVC,AZX,CA.SK.FCAA,CA.NL.DSS","QQQQQQQQQRRCGHDKLKSLS"

Now you can see that field 4 has data which contains comma as well. now I need the data in which the field 4 is sorted out as below:

"PT3QB789TSUIDF371261","THE TORONTO,DOMINION BANK","HZSN7FQBPO5IEWYIGC72","AAAA,ASIC*,CA.ON.OSC,MAS","XVCCCCCCCCCCYYUUUUU"
"11111111111111111111","ABC,XYZ,QWE","HZSN7FQBPO5IEWYIGC72","ABC,CA.ON.OSC,CA.QC.OSC,MAS,POU","XVRRRRRRRRTTTTTTTTTTTTT"
"22222222222222222222","BHC,NBC,MKY","HZSN7FQBPO5IEWYIGC72","AZX,BVC,CA.NL.DSS,CA.SK.FCAA","QQQQQQQQQRRCGHDKLKSLS"

to get this solution I have written a script file as below but the solution does not seems to be efficient because for 100k records it took 20 mins, so trying to get the efficient solution

#this command replaces the comma inside "" with | so that I can split the line based on ','(comma)
awk -F"\"" 'BEGIN{OFS="\""}{for(i=1;i<=NF;++i){ if(i%2==0) gsub(/,/, "|", $i)}} {print $0}' $FEED_FILE > temp.csv

count=0;
while read line
do
       #break the line on comma ',' and get the array of strings.
           IFS=',' read -ra data <<< "$line" #'data' is the array of the record of full line.
           #take the 8th column, which is the reportable jurisdiction.  
            echo "REPORTABLE_JURISDICTION is : " ${data[4]}
            #brake the data based on pipe '|' and sort the data
                IFS='|' read -ra REPORTABLE_JURISDICTION_ARR <<< "${data[4]}"
                    #Sort this array
                    IFS=$'\n' sorted=($(sort <<<"${REPORTABLE_JURISDICTION_ARR[*]}"))           
                    #printf "[%s]\n" "${sorted[@]}"
                    separator="|" # e.g. constructing regex, pray it does not contain %s
                    regex="$( printf "${separator}%s" "${sorted[@]}" )"
                    regex="${regex:${#separator}}" # remove leading separator
                    echo "${regex}"
            data[4]=${regex}
            echo "$data[68]"
            #here we are building the whole line which will be written to the output file.
            separator="," # e.g. constructing regex, pray it does not contain %s
                    regex="$( printf "${separator}%s" "${data[@]}" )"
                    regex="${regex:${#separator}}" # remove leading separator
                        echo "${regex}" >> temp2.csv
        echo $count
        ((count++))    
done < temp.csv

#remove the '|' from the and put the comma back
awk -F\| 'BEGIN{OFS=","} {$1=$1; print}' temp2.csv > temp3.csv
# to remove the tailing , if any
sed 's/,$//' temp3.csv > $OUT_FILE

How to make it faster?

You're using the wrong tools for the task. While CSV seems to be so simple that you can easily process it with shell tools, but your code will break for cells that contain new lines. Also bash isn't very fast when processing lots of data.

Try a tool which understands CSV directly like http://csvkit.rtfd.org/ or use a programming language like Python. That allows you to do the task without starting external processes, the syntax is much more readable and the result will be much more maintainable. Note: I'm suggesting Python because of the low initial cost.

With python and the csv module , the code above would look like this:

import csv
FEED_FILE = '...'
OUT_FILE = '...'
with open(OUT_FILE, 'w', newline='') as out:
    with open(FEED_FILE, newline='') as in:
        reader = csv.reader(in, delimiter=',', quotechar='"')
        writer = csv.writer(
        for row in reader:
            row[3] = sorted(list(row[3].split(',')))
            writer.writerow(row)

That said, there is nothing obviously wrong with your code. There is not much that you can do to speed up awk and sed and the main bash loop doesn't spawn many external processes as far as I can see.

With single awk :

awk 'BEGIN{ FS=OFS="\042,\042"}{ split($4,a,","); asort(a); sf=a[1]; 
     for(i=2;i<=NF;i++) { sf=sf","a[i] } $4=sf; print $0 }' file > output.csv

output.csv contents:

"PT3QB789TSUIDF371261","THE TORONTO,DOMINION BANK","HZSN7FQBPO5IEWYIGC72","AAAA,ASIC*,CA.ON.OSC,MAS,","XVCCCCCCCCCCYYUUUUU"
"11111111111111111111","ABC,XYZ,QWE","HZSN7FQBPO5IEWYIGC72","ABC,CA.ON.OSC,CA.QC.OSC,MAS,POU","XVRRRRRRRRTTTTTTTTTTTTT"
"22222222222222222222","BHC,NBC,MKY","HZSN7FQBPO5IEWYIGC72","AZX,BVC,CA.NL.DSS,CA.SK.FCAA,","QQQQQQQQQRRCGHDKLKSLS"

  • FS=OFS="\\042,\\042" - considering "," as field separator

  • split($4,a,",") - split the 4th field into array by separator ,

  • asort(a) - sort the array by values

Try pandas in python3 . Only limitation: The data needs to fit into memory. And that can be a bit larger than your actually data is. I sorted CSV files with 30.000.000 rows without any problem using this script, which I quickly wrote:

import pandas as pd
import os, datetime, traceback

L1_DIR = '/mnt/ssd/ASCII/'
suffix = '.csv

for fname in sorted(os.listdir(L1_DIR)):
    if not fname.endswith(suffix):
        continue
    print("Start processing %s" % fname)
    s = datetime.datetime.now()
    fin_path = os.path.join(L1_DIR, fname)
    fname_out = fname.split('.')[0] + '.csv_sorted'
    fpath_out = os.path.join(L1_DIR, fname_out)

    df = pd.read_csv(fin_path)
    e = datetime.datetime.now()
    print("Read %s rows from %s. Took (%s)" % (len(df.index), fname, (e-s)))

    s = datetime.datetime.now()
    df.set_index('ts', inplace=True)
    e = datetime.datetime.now()
    print("set_index %s rows from %s. Took (%s)" % (len(df.index), fname, (e-s)))

    s = datetime.datetime.now()
    df.sort_index(inplace=True)
    e = datetime.datetime.now()
    print("sort_index %s rows from [%s] to [%s]. Took (%s)" % (len(df.index), fname, fname_out, (e-s)))

    s = datetime.datetime.now()
    df.reset_index(inplace=True)
    # This one saves at ~10MB per second to disk.. One day is 7.5GB --> 750 seconds or 12.5 minutes
    df.to_csv(fpath_out, index=False)
    e = datetime.datetime.now()
    print("to_csv %s rows from [%s] to [%s]. Took (%s)" % (len(df.index), fname, fname_out, (e - s)))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM