繁体   English   中英

使用 awk 或 bash 从大型 csv(超过一百万条记录)中对列值进行排序

[英]Sorting a columns value from a large csv(more than a million records) using awk or bash

我是 shell 脚本的新手。

我有一个巨大的 csv 文件,其中包含超过 10 万行。 我需要找到一列并对其进行排序并将其写入另一个文件,然后我需要处理这个新文件。 下面是示例数据

"PT3QB789TSUIDF371261","THE TORONTO,DOMINION BANK","HZSN7FQBPO5IEWYIGC72","MAS,CA.ON.OSC,ASIC*,AAAA","XVCCCCCCCCCCYYUUUUU"
"11111111111111111111","ABC,XYZ,QWE","HZSN7FQBPO5IEWYIGC72","POU,ABC,MAS,CA.QC.OSC,CA.ON.OSC","XVRRRRRRRRTTTTTTTTTTTTT"
"22222222222222222222","BHC,NBC,MKY","HZSN7FQBPO5IEWYIGC72","BVC,AZX,CA.SK.FCAA,CA.NL.DSS","QQQQQQQQQRRCGHDKLKSLS"

现在您可以看到字段 4 的数据也包含逗号。 现在我需要字段 4 整理的数据如下:

"PT3QB789TSUIDF371261","THE TORONTO,DOMINION BANK","HZSN7FQBPO5IEWYIGC72","AAAA,ASIC*,CA.ON.OSC,MAS","XVCCCCCCCCCCYYUUUUU"
"11111111111111111111","ABC,XYZ,QWE","HZSN7FQBPO5IEWYIGC72","ABC,CA.ON.OSC,CA.QC.OSC,MAS,POU","XVRRRRRRRRTTTTTTTTTTTTT"
"22222222222222222222","BHC,NBC,MKY","HZSN7FQBPO5IEWYIGC72","AZX,BVC,CA.NL.DSS,CA.SK.FCAA","QQQQQQQQQRRCGHDKLKSLS"

为了获得这个解决方案,我编写了一个如下的脚本文件,但该解决方案似乎效率不高,因为 100k 条记录需要 20 分钟,所以试图获得有效的解决方案

#this command replaces the comma inside "" with | so that I can split the line based on ','(comma)
awk -F"\"" 'BEGIN{OFS="\""}{for(i=1;i<=NF;++i){ if(i%2==0) gsub(/,/, "|", $i)}} {print $0}' $FEED_FILE > temp.csv

count=0;
while read line
do
       #break the line on comma ',' and get the array of strings.
           IFS=',' read -ra data <<< "$line" #'data' is the array of the record of full line.
           #take the 8th column, which is the reportable jurisdiction.  
            echo "REPORTABLE_JURISDICTION is : " ${data[4]}
            #brake the data based on pipe '|' and sort the data
                IFS='|' read -ra REPORTABLE_JURISDICTION_ARR <<< "${data[4]}"
                    #Sort this array
                    IFS=$'\n' sorted=($(sort <<<"${REPORTABLE_JURISDICTION_ARR[*]}"))           
                    #printf "[%s]\n" "${sorted[@]}"
                    separator="|" # e.g. constructing regex, pray it does not contain %s
                    regex="$( printf "${separator}%s" "${sorted[@]}" )"
                    regex="${regex:${#separator}}" # remove leading separator
                    echo "${regex}"
            data[4]=${regex}
            echo "$data[68]"
            #here we are building the whole line which will be written to the output file.
            separator="," # e.g. constructing regex, pray it does not contain %s
                    regex="$( printf "${separator}%s" "${data[@]}" )"
                    regex="${regex:${#separator}}" # remove leading separator
                        echo "${regex}" >> temp2.csv
        echo $count
        ((count++))    
done < temp.csv

#remove the '|' from the and put the comma back
awk -F\| 'BEGIN{OFS=","} {$1=$1; print}' temp2.csv > temp3.csv
# to remove the tailing , if any
sed 's/,$//' temp3.csv > $OUT_FILE

如何让它更快?

您正在使用错误的工具来完成任务。 虽然 CSV 看起来非常简单,您可以使用 shell 工具轻松处理它,但是对于包含新行的单元格,您的代码会中断。 bash 在处理大量数据时也不是很快。

尝试使用直接理解 CSV 的工具(如http://csvkit.rtfd.org/)或使用 Python 等编程语言。 这允许您在不启动外部进程的情况下完成任务,语法更具可读性,结果将更易于维护。 注意:由于初始成本低,我建议使用 Python。

使用 python 和csv模块,上面的代码看起来像这样:

import csv
FEED_FILE = '...'
OUT_FILE = '...'
with open(OUT_FILE, 'w', newline='') as out:
    with open(FEED_FILE, newline='') as in:
        reader = csv.reader(in, delimiter=',', quotechar='"')
        writer = csv.writer(
        for row in reader:
            row[3] = sorted(list(row[3].split(',')))
            writer.writerow(row)

也就是说,您的代码没有任何明显错误。 没有什么可以加快 awk 和 sed 的速度,而且就我所见,主 bash 循环不会产生许多外部进程。

使用单个awk

awk 'BEGIN{ FS=OFS="\042,\042"}{ split($4,a,","); asort(a); sf=a[1]; 
     for(i=2;i<=NF;i++) { sf=sf","a[i] } $4=sf; print $0 }' file > output.csv

output.csv内容:

"PT3QB789TSUIDF371261","THE TORONTO,DOMINION BANK","HZSN7FQBPO5IEWYIGC72","AAAA,ASIC*,CA.ON.OSC,MAS,","XVCCCCCCCCCCYYUUUUU"
"11111111111111111111","ABC,XYZ,QWE","HZSN7FQBPO5IEWYIGC72","ABC,CA.ON.OSC,CA.QC.OSC,MAS,POU","XVRRRRRRRRTTTTTTTTTTTTT"
"22222222222222222222","BHC,NBC,MKY","HZSN7FQBPO5IEWYIGC72","AZX,BVC,CA.NL.DSS,CA.SK.FCAA,","QQQQQQQQQRRCGHDKLKSLS"

  • FS=OFS="\\042,\\042" - 考虑","作为字段分隔符

  • split($4,a,",") - 按分隔符将第 4 个字段拆分为数组,

  • asort(a) - 按值对数组进行排序

python3尝试熊猫。 唯一的限制:数据需要适合内存。 这可能比您的实际数据大一点。 我使用这个脚本对 30.000.000 行的 CSV 文件进行了排序,没有任何问题,我很快写道:

import pandas as pd
import os, datetime, traceback

L1_DIR = '/mnt/ssd/ASCII/'
suffix = '.csv

for fname in sorted(os.listdir(L1_DIR)):
    if not fname.endswith(suffix):
        continue
    print("Start processing %s" % fname)
    s = datetime.datetime.now()
    fin_path = os.path.join(L1_DIR, fname)
    fname_out = fname.split('.')[0] + '.csv_sorted'
    fpath_out = os.path.join(L1_DIR, fname_out)

    df = pd.read_csv(fin_path)
    e = datetime.datetime.now()
    print("Read %s rows from %s. Took (%s)" % (len(df.index), fname, (e-s)))

    s = datetime.datetime.now()
    df.set_index('ts', inplace=True)
    e = datetime.datetime.now()
    print("set_index %s rows from %s. Took (%s)" % (len(df.index), fname, (e-s)))

    s = datetime.datetime.now()
    df.sort_index(inplace=True)
    e = datetime.datetime.now()
    print("sort_index %s rows from [%s] to [%s]. Took (%s)" % (len(df.index), fname, fname_out, (e-s)))

    s = datetime.datetime.now()
    df.reset_index(inplace=True)
    # This one saves at ~10MB per second to disk.. One day is 7.5GB --> 750 seconds or 12.5 minutes
    df.to_csv(fpath_out, index=False)
    e = datetime.datetime.now()
    print("to_csv %s rows from [%s] to [%s]. Took (%s)" % (len(df.index), fname, fname_out, (e - s)))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM