簡體   English   中英

使用 awk 或 bash 從大型 csv(超過一百萬條記錄)中對列值進行排序

[英]Sorting a columns value from a large csv(more than a million records) using awk or bash

我是 shell 腳本的新手。

我有一個巨大的 csv 文件,其中包含超過 10 萬行。 我需要找到一列並對其進行排序並將其寫入另一個文件,然后我需要處理這個新文件。 下面是示例數據

"PT3QB789TSUIDF371261","THE TORONTO,DOMINION BANK","HZSN7FQBPO5IEWYIGC72","MAS,CA.ON.OSC,ASIC*,AAAA","XVCCCCCCCCCCYYUUUUU"
"11111111111111111111","ABC,XYZ,QWE","HZSN7FQBPO5IEWYIGC72","POU,ABC,MAS,CA.QC.OSC,CA.ON.OSC","XVRRRRRRRRTTTTTTTTTTTTT"
"22222222222222222222","BHC,NBC,MKY","HZSN7FQBPO5IEWYIGC72","BVC,AZX,CA.SK.FCAA,CA.NL.DSS","QQQQQQQQQRRCGHDKLKSLS"

現在您可以看到字段 4 的數據也包含逗號。 現在我需要字段 4 整理的數據如下:

"PT3QB789TSUIDF371261","THE TORONTO,DOMINION BANK","HZSN7FQBPO5IEWYIGC72","AAAA,ASIC*,CA.ON.OSC,MAS","XVCCCCCCCCCCYYUUUUU"
"11111111111111111111","ABC,XYZ,QWE","HZSN7FQBPO5IEWYIGC72","ABC,CA.ON.OSC,CA.QC.OSC,MAS,POU","XVRRRRRRRRTTTTTTTTTTTTT"
"22222222222222222222","BHC,NBC,MKY","HZSN7FQBPO5IEWYIGC72","AZX,BVC,CA.NL.DSS,CA.SK.FCAA","QQQQQQQQQRRCGHDKLKSLS"

為了獲得這個解決方案,我編寫了一個如下的腳本文件,但該解決方案似乎效率不高,因為 100k 條記錄需要 20 分鍾,所以試圖獲得有效的解決方案

#this command replaces the comma inside "" with | so that I can split the line based on ','(comma)
awk -F"\"" 'BEGIN{OFS="\""}{for(i=1;i<=NF;++i){ if(i%2==0) gsub(/,/, "|", $i)}} {print $0}' $FEED_FILE > temp.csv

count=0;
while read line
do
       #break the line on comma ',' and get the array of strings.
           IFS=',' read -ra data <<< "$line" #'data' is the array of the record of full line.
           #take the 8th column, which is the reportable jurisdiction.  
            echo "REPORTABLE_JURISDICTION is : " ${data[4]}
            #brake the data based on pipe '|' and sort the data
                IFS='|' read -ra REPORTABLE_JURISDICTION_ARR <<< "${data[4]}"
                    #Sort this array
                    IFS=$'\n' sorted=($(sort <<<"${REPORTABLE_JURISDICTION_ARR[*]}"))           
                    #printf "[%s]\n" "${sorted[@]}"
                    separator="|" # e.g. constructing regex, pray it does not contain %s
                    regex="$( printf "${separator}%s" "${sorted[@]}" )"
                    regex="${regex:${#separator}}" # remove leading separator
                    echo "${regex}"
            data[4]=${regex}
            echo "$data[68]"
            #here we are building the whole line which will be written to the output file.
            separator="," # e.g. constructing regex, pray it does not contain %s
                    regex="$( printf "${separator}%s" "${data[@]}" )"
                    regex="${regex:${#separator}}" # remove leading separator
                        echo "${regex}" >> temp2.csv
        echo $count
        ((count++))    
done < temp.csv

#remove the '|' from the and put the comma back
awk -F\| 'BEGIN{OFS=","} {$1=$1; print}' temp2.csv > temp3.csv
# to remove the tailing , if any
sed 's/,$//' temp3.csv > $OUT_FILE

如何讓它更快?

您正在使用錯誤的工具來完成任務。 雖然 CSV 看起來非常簡單,您可以使用 shell 工具輕松處理它,但是對於包含新行的單元格,您的代碼會中斷。 bash 在處理大量數據時也不是很快。

嘗試使用直接理解 CSV 的工具(如http://csvkit.rtfd.org/)或使用 Python 等編程語言。 這允許您在不啟動外部進程的情況下完成任務,語法更具可讀性,結果將更易於維護。 注意:由於初始成本低,我建議使用 Python。

使用 python 和csv模塊,上面的代碼看起來像這樣:

import csv
FEED_FILE = '...'
OUT_FILE = '...'
with open(OUT_FILE, 'w', newline='') as out:
    with open(FEED_FILE, newline='') as in:
        reader = csv.reader(in, delimiter=',', quotechar='"')
        writer = csv.writer(
        for row in reader:
            row[3] = sorted(list(row[3].split(',')))
            writer.writerow(row)

也就是說,您的代碼沒有任何明顯錯誤。 沒有什么可以加快 awk 和 sed 的速度,而且就我所見,主 bash 循環不會產生許多外部進程。

使用單個awk

awk 'BEGIN{ FS=OFS="\042,\042"}{ split($4,a,","); asort(a); sf=a[1]; 
     for(i=2;i<=NF;i++) { sf=sf","a[i] } $4=sf; print $0 }' file > output.csv

output.csv內容:

"PT3QB789TSUIDF371261","THE TORONTO,DOMINION BANK","HZSN7FQBPO5IEWYIGC72","AAAA,ASIC*,CA.ON.OSC,MAS,","XVCCCCCCCCCCYYUUUUU"
"11111111111111111111","ABC,XYZ,QWE","HZSN7FQBPO5IEWYIGC72","ABC,CA.ON.OSC,CA.QC.OSC,MAS,POU","XVRRRRRRRRTTTTTTTTTTTTT"
"22222222222222222222","BHC,NBC,MKY","HZSN7FQBPO5IEWYIGC72","AZX,BVC,CA.NL.DSS,CA.SK.FCAA,","QQQQQQQQQRRCGHDKLKSLS"

  • FS=OFS="\\042,\\042" - 考慮","作為字段分隔符

  • split($4,a,",") - 按分隔符將第 4 個字段拆分為數組,

  • asort(a) - 按值對數組進行排序

python3嘗試熊貓。 唯一的限制:數據需要適合內存。 這可能比您的實際數據大一點。 我使用這個腳本對 30.000.000 行的 CSV 文件進行了排序,沒有任何問題,我很快寫道:

import pandas as pd
import os, datetime, traceback

L1_DIR = '/mnt/ssd/ASCII/'
suffix = '.csv

for fname in sorted(os.listdir(L1_DIR)):
    if not fname.endswith(suffix):
        continue
    print("Start processing %s" % fname)
    s = datetime.datetime.now()
    fin_path = os.path.join(L1_DIR, fname)
    fname_out = fname.split('.')[0] + '.csv_sorted'
    fpath_out = os.path.join(L1_DIR, fname_out)

    df = pd.read_csv(fin_path)
    e = datetime.datetime.now()
    print("Read %s rows from %s. Took (%s)" % (len(df.index), fname, (e-s)))

    s = datetime.datetime.now()
    df.set_index('ts', inplace=True)
    e = datetime.datetime.now()
    print("set_index %s rows from %s. Took (%s)" % (len(df.index), fname, (e-s)))

    s = datetime.datetime.now()
    df.sort_index(inplace=True)
    e = datetime.datetime.now()
    print("sort_index %s rows from [%s] to [%s]. Took (%s)" % (len(df.index), fname, fname_out, (e-s)))

    s = datetime.datetime.now()
    df.reset_index(inplace=True)
    # This one saves at ~10MB per second to disk.. One day is 7.5GB --> 750 seconds or 12.5 minutes
    df.to_csv(fpath_out, index=False)
    e = datetime.datetime.now()
    print("to_csv %s rows from [%s] to [%s]. Took (%s)" % (len(df.index), fname, fname_out, (e - s)))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM