[英]Sorting a columns value from a large csv(more than a million records) using awk or bash
我是 shell 腳本的新手。
我有一個巨大的 csv 文件,其中包含超過 10 萬行。 我需要找到一列並對其進行排序並將其寫入另一個文件,然后我需要處理這個新文件。 下面是示例數據
"PT3QB789TSUIDF371261","THE TORONTO,DOMINION BANK","HZSN7FQBPO5IEWYIGC72","MAS,CA.ON.OSC,ASIC*,AAAA","XVCCCCCCCCCCYYUUUUU"
"11111111111111111111","ABC,XYZ,QWE","HZSN7FQBPO5IEWYIGC72","POU,ABC,MAS,CA.QC.OSC,CA.ON.OSC","XVRRRRRRRRTTTTTTTTTTTTT"
"22222222222222222222","BHC,NBC,MKY","HZSN7FQBPO5IEWYIGC72","BVC,AZX,CA.SK.FCAA,CA.NL.DSS","QQQQQQQQQRRCGHDKLKSLS"
現在您可以看到字段 4 的數據也包含逗號。 現在我需要字段 4 整理的數據如下:
"PT3QB789TSUIDF371261","THE TORONTO,DOMINION BANK","HZSN7FQBPO5IEWYIGC72","AAAA,ASIC*,CA.ON.OSC,MAS","XVCCCCCCCCCCYYUUUUU"
"11111111111111111111","ABC,XYZ,QWE","HZSN7FQBPO5IEWYIGC72","ABC,CA.ON.OSC,CA.QC.OSC,MAS,POU","XVRRRRRRRRTTTTTTTTTTTTT"
"22222222222222222222","BHC,NBC,MKY","HZSN7FQBPO5IEWYIGC72","AZX,BVC,CA.NL.DSS,CA.SK.FCAA","QQQQQQQQQRRCGHDKLKSLS"
為了獲得這個解決方案,我編寫了一個如下的腳本文件,但該解決方案似乎效率不高,因為 100k 條記錄需要 20 分鍾,所以試圖獲得有效的解決方案
#this command replaces the comma inside "" with | so that I can split the line based on ','(comma)
awk -F"\"" 'BEGIN{OFS="\""}{for(i=1;i<=NF;++i){ if(i%2==0) gsub(/,/, "|", $i)}} {print $0}' $FEED_FILE > temp.csv
count=0;
while read line
do
#break the line on comma ',' and get the array of strings.
IFS=',' read -ra data <<< "$line" #'data' is the array of the record of full line.
#take the 8th column, which is the reportable jurisdiction.
echo "REPORTABLE_JURISDICTION is : " ${data[4]}
#brake the data based on pipe '|' and sort the data
IFS='|' read -ra REPORTABLE_JURISDICTION_ARR <<< "${data[4]}"
#Sort this array
IFS=$'\n' sorted=($(sort <<<"${REPORTABLE_JURISDICTION_ARR[*]}"))
#printf "[%s]\n" "${sorted[@]}"
separator="|" # e.g. constructing regex, pray it does not contain %s
regex="$( printf "${separator}%s" "${sorted[@]}" )"
regex="${regex:${#separator}}" # remove leading separator
echo "${regex}"
data[4]=${regex}
echo "$data[68]"
#here we are building the whole line which will be written to the output file.
separator="," # e.g. constructing regex, pray it does not contain %s
regex="$( printf "${separator}%s" "${data[@]}" )"
regex="${regex:${#separator}}" # remove leading separator
echo "${regex}" >> temp2.csv
echo $count
((count++))
done < temp.csv
#remove the '|' from the and put the comma back
awk -F\| 'BEGIN{OFS=","} {$1=$1; print}' temp2.csv > temp3.csv
# to remove the tailing , if any
sed 's/,$//' temp3.csv > $OUT_FILE
如何讓它更快?
您正在使用錯誤的工具來完成任務。 雖然 CSV 看起來非常簡單,您可以使用 shell 工具輕松處理它,但是對於包含新行的單元格,您的代碼會中斷。 bash 在處理大量數據時也不是很快。
嘗試使用直接理解 CSV 的工具(如http://csvkit.rtfd.org/)或使用 Python 等編程語言。 這允許您在不啟動外部進程的情況下完成任務,語法更具可讀性,結果將更易於維護。 注意:由於初始成本低,我建議使用 Python。
使用 python 和csv
模塊,上面的代碼看起來像這樣:
import csv
FEED_FILE = '...'
OUT_FILE = '...'
with open(OUT_FILE, 'w', newline='') as out:
with open(FEED_FILE, newline='') as in:
reader = csv.reader(in, delimiter=',', quotechar='"')
writer = csv.writer(
for row in reader:
row[3] = sorted(list(row[3].split(',')))
writer.writerow(row)
也就是說,您的代碼沒有任何明顯錯誤。 沒有什么可以加快 awk 和 sed 的速度,而且就我所見,主 bash 循環不會產生許多外部進程。
使用單個awk :
awk 'BEGIN{ FS=OFS="\042,\042"}{ split($4,a,","); asort(a); sf=a[1];
for(i=2;i<=NF;i++) { sf=sf","a[i] } $4=sf; print $0 }' file > output.csv
output.csv內容:
"PT3QB789TSUIDF371261","THE TORONTO,DOMINION BANK","HZSN7FQBPO5IEWYIGC72","AAAA,ASIC*,CA.ON.OSC,MAS,","XVCCCCCCCCCCYYUUUUU"
"11111111111111111111","ABC,XYZ,QWE","HZSN7FQBPO5IEWYIGC72","ABC,CA.ON.OSC,CA.QC.OSC,MAS,POU","XVRRRRRRRRTTTTTTTTTTTTT"
"22222222222222222222","BHC,NBC,MKY","HZSN7FQBPO5IEWYIGC72","AZX,BVC,CA.NL.DSS,CA.SK.FCAA,","QQQQQQQQQRRCGHDKLKSLS"
FS=OFS="\\042,\\042"
- 考慮","
作為字段分隔符
split($4,a,",")
- 按分隔符將第 4 個字段拆分為數組,
asort(a)
- 按值對數組進行排序
在python3
嘗試熊貓。 唯一的限制:數據需要適合內存。 這可能比您的實際數據大一點。 我使用這個腳本對 30.000.000 行的 CSV 文件進行了排序,沒有任何問題,我很快寫道:
import pandas as pd
import os, datetime, traceback
L1_DIR = '/mnt/ssd/ASCII/'
suffix = '.csv
for fname in sorted(os.listdir(L1_DIR)):
if not fname.endswith(suffix):
continue
print("Start processing %s" % fname)
s = datetime.datetime.now()
fin_path = os.path.join(L1_DIR, fname)
fname_out = fname.split('.')[0] + '.csv_sorted'
fpath_out = os.path.join(L1_DIR, fname_out)
df = pd.read_csv(fin_path)
e = datetime.datetime.now()
print("Read %s rows from %s. Took (%s)" % (len(df.index), fname, (e-s)))
s = datetime.datetime.now()
df.set_index('ts', inplace=True)
e = datetime.datetime.now()
print("set_index %s rows from %s. Took (%s)" % (len(df.index), fname, (e-s)))
s = datetime.datetime.now()
df.sort_index(inplace=True)
e = datetime.datetime.now()
print("sort_index %s rows from [%s] to [%s]. Took (%s)" % (len(df.index), fname, fname_out, (e-s)))
s = datetime.datetime.now()
df.reset_index(inplace=True)
# This one saves at ~10MB per second to disk.. One day is 7.5GB --> 750 seconds or 12.5 minutes
df.to_csv(fpath_out, index=False)
e = datetime.datetime.now()
print("to_csv %s rows from [%s] to [%s]. Took (%s)" % (len(df.index), fname, fname_out, (e - s)))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.