简体   繁体   English

使用 awk 或 bash 从大型 csv(超过一百万条记录)中对列值进行排序

[英]Sorting a columns value from a large csv(more than a million records) using awk or bash

I am new to shell scripting.我是 shell 脚本的新手。

I have a huge csv file which contains more than 100k rows.我有一个巨大的 csv 文件,其中包含超过 10 万行。 I need to find a column and sort it and write it to another file and later I need to process this new file.我需要找到一列并对其进行排序并将其写入另一个文件,然后我需要处理这个新文件。 below is the sample data下面是示例数据

"PT3QB789TSUIDF371261","THE TORONTO,DOMINION BANK","HZSN7FQBPO5IEWYIGC72","MAS,CA.ON.OSC,ASIC*,AAAA","XVCCCCCCCCCCYYUUUUU"
"11111111111111111111","ABC,XYZ,QWE","HZSN7FQBPO5IEWYIGC72","POU,ABC,MAS,CA.QC.OSC,CA.ON.OSC","XVRRRRRRRRTTTTTTTTTTTTT"
"22222222222222222222","BHC,NBC,MKY","HZSN7FQBPO5IEWYIGC72","BVC,AZX,CA.SK.FCAA,CA.NL.DSS","QQQQQQQQQRRCGHDKLKSLS"

Now you can see that field 4 has data which contains comma as well.现在您可以看到字段 4 的数据也包含逗号。 now I need the data in which the field 4 is sorted out as below:现在我需要字段 4 整理的数据如下:

"PT3QB789TSUIDF371261","THE TORONTO,DOMINION BANK","HZSN7FQBPO5IEWYIGC72","AAAA,ASIC*,CA.ON.OSC,MAS","XVCCCCCCCCCCYYUUUUU"
"11111111111111111111","ABC,XYZ,QWE","HZSN7FQBPO5IEWYIGC72","ABC,CA.ON.OSC,CA.QC.OSC,MAS,POU","XVRRRRRRRRTTTTTTTTTTTTT"
"22222222222222222222","BHC,NBC,MKY","HZSN7FQBPO5IEWYIGC72","AZX,BVC,CA.NL.DSS,CA.SK.FCAA","QQQQQQQQQRRCGHDKLKSLS"

to get this solution I have written a script file as below but the solution does not seems to be efficient because for 100k records it took 20 mins, so trying to get the efficient solution为了获得这个解决方案,我编写了一个如下的脚本文件,但该解决方案似乎效率不高,因为 100k 条记录需要 20 分钟,所以试图获得有效的解决方案

#this command replaces the comma inside "" with | so that I can split the line based on ','(comma)
awk -F"\"" 'BEGIN{OFS="\""}{for(i=1;i<=NF;++i){ if(i%2==0) gsub(/,/, "|", $i)}} {print $0}' $FEED_FILE > temp.csv

count=0;
while read line
do
       #break the line on comma ',' and get the array of strings.
           IFS=',' read -ra data <<< "$line" #'data' is the array of the record of full line.
           #take the 8th column, which is the reportable jurisdiction.  
            echo "REPORTABLE_JURISDICTION is : " ${data[4]}
            #brake the data based on pipe '|' and sort the data
                IFS='|' read -ra REPORTABLE_JURISDICTION_ARR <<< "${data[4]}"
                    #Sort this array
                    IFS=$'\n' sorted=($(sort <<<"${REPORTABLE_JURISDICTION_ARR[*]}"))           
                    #printf "[%s]\n" "${sorted[@]}"
                    separator="|" # e.g. constructing regex, pray it does not contain %s
                    regex="$( printf "${separator}%s" "${sorted[@]}" )"
                    regex="${regex:${#separator}}" # remove leading separator
                    echo "${regex}"
            data[4]=${regex}
            echo "$data[68]"
            #here we are building the whole line which will be written to the output file.
            separator="," # e.g. constructing regex, pray it does not contain %s
                    regex="$( printf "${separator}%s" "${data[@]}" )"
                    regex="${regex:${#separator}}" # remove leading separator
                        echo "${regex}" >> temp2.csv
        echo $count
        ((count++))    
done < temp.csv

#remove the '|' from the and put the comma back
awk -F\| 'BEGIN{OFS=","} {$1=$1; print}' temp2.csv > temp3.csv
# to remove the tailing , if any
sed 's/,$//' temp3.csv > $OUT_FILE

How to make it faster?如何让它更快?

You're using the wrong tools for the task.您正在使用错误的工具来完成任务。 While CSV seems to be so simple that you can easily process it with shell tools, but your code will break for cells that contain new lines.虽然 CSV 看起来非常简单,您可以使用 shell 工具轻松处理它,但是对于包含新行的单元格,您的代码会中断。 Also bash isn't very fast when processing lots of data. bash 在处理大量数据时也不是很快。

Try a tool which understands CSV directly like http://csvkit.rtfd.org/ or use a programming language like Python.尝试使用直接理解 CSV 的工具(如http://csvkit.rtfd.org/)或使用 Python 等编程语言。 That allows you to do the task without starting external processes, the syntax is much more readable and the result will be much more maintainable.这允许您在不启动外部进程的情况下完成任务,语法更具可读性,结果将更易于维护。 Note: I'm suggesting Python because of the low initial cost.注意:由于初始成本低,我建议使用 Python。

With python and the csv module , the code above would look like this:使用 python 和csv模块,上面的代码看起来像这样:

import csv
FEED_FILE = '...'
OUT_FILE = '...'
with open(OUT_FILE, 'w', newline='') as out:
    with open(FEED_FILE, newline='') as in:
        reader = csv.reader(in, delimiter=',', quotechar='"')
        writer = csv.writer(
        for row in reader:
            row[3] = sorted(list(row[3].split(',')))
            writer.writerow(row)

That said, there is nothing obviously wrong with your code.也就是说,您的代码没有任何明显错误。 There is not much that you can do to speed up awk and sed and the main bash loop doesn't spawn many external processes as far as I can see.没有什么可以加快 awk 和 sed 的速度,而且就我所见,主 bash 循环不会产生许多外部进程。

With single awk :使用单个awk

awk 'BEGIN{ FS=OFS="\042,\042"}{ split($4,a,","); asort(a); sf=a[1]; 
     for(i=2;i<=NF;i++) { sf=sf","a[i] } $4=sf; print $0 }' file > output.csv

output.csv contents: output.csv内容:

"PT3QB789TSUIDF371261","THE TORONTO,DOMINION BANK","HZSN7FQBPO5IEWYIGC72","AAAA,ASIC*,CA.ON.OSC,MAS,","XVCCCCCCCCCCYYUUUUU"
"11111111111111111111","ABC,XYZ,QWE","HZSN7FQBPO5IEWYIGC72","ABC,CA.ON.OSC,CA.QC.OSC,MAS,POU","XVRRRRRRRRTTTTTTTTTTTTT"
"22222222222222222222","BHC,NBC,MKY","HZSN7FQBPO5IEWYIGC72","AZX,BVC,CA.NL.DSS,CA.SK.FCAA,","QQQQQQQQQRRCGHDKLKSLS"

  • FS=OFS="\\042,\\042" - considering "," as field separator FS=OFS="\\042,\\042" - 考虑","作为字段分隔符

  • split($4,a,",") - split the 4th field into array by separator , split($4,a,",") - 按分隔符将第 4 个字段拆分为数组,

  • asort(a) - sort the array by values asort(a) - 按值对数组进行排序

Try pandas in python3 .python3尝试熊猫。 Only limitation: The data needs to fit into memory.唯一的限制:数据需要适合内存。 And that can be a bit larger than your actually data is.这可能比您的实际数据大一点。 I sorted CSV files with 30.000.000 rows without any problem using this script, which I quickly wrote:我使用这个脚本对 30.000.000 行的 CSV 文件进行了排序,没有任何问题,我很快写道:

import pandas as pd
import os, datetime, traceback

L1_DIR = '/mnt/ssd/ASCII/'
suffix = '.csv

for fname in sorted(os.listdir(L1_DIR)):
    if not fname.endswith(suffix):
        continue
    print("Start processing %s" % fname)
    s = datetime.datetime.now()
    fin_path = os.path.join(L1_DIR, fname)
    fname_out = fname.split('.')[0] + '.csv_sorted'
    fpath_out = os.path.join(L1_DIR, fname_out)

    df = pd.read_csv(fin_path)
    e = datetime.datetime.now()
    print("Read %s rows from %s. Took (%s)" % (len(df.index), fname, (e-s)))

    s = datetime.datetime.now()
    df.set_index('ts', inplace=True)
    e = datetime.datetime.now()
    print("set_index %s rows from %s. Took (%s)" % (len(df.index), fname, (e-s)))

    s = datetime.datetime.now()
    df.sort_index(inplace=True)
    e = datetime.datetime.now()
    print("sort_index %s rows from [%s] to [%s]. Took (%s)" % (len(df.index), fname, fname_out, (e-s)))

    s = datetime.datetime.now()
    df.reset_index(inplace=True)
    # This one saves at ~10MB per second to disk.. One day is 7.5GB --> 750 seconds or 12.5 minutes
    df.to_csv(fpath_out, index=False)
    e = datetime.datetime.now()
    print("to_csv %s rows from [%s] to [%s]. Took (%s)" % (len(df.index), fname, fname_out, (e - s)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM