[英]Sorting a columns value from a large csv(more than a million records) using awk or bash
我是 shell 脚本的新手。
我有一个巨大的 csv 文件,其中包含超过 10 万行。 我需要找到一列并对其进行排序并将其写入另一个文件,然后我需要处理这个新文件。 下面是示例数据
"PT3QB789TSUIDF371261","THE TORONTO,DOMINION BANK","HZSN7FQBPO5IEWYIGC72","MAS,CA.ON.OSC,ASIC*,AAAA","XVCCCCCCCCCCYYUUUUU"
"11111111111111111111","ABC,XYZ,QWE","HZSN7FQBPO5IEWYIGC72","POU,ABC,MAS,CA.QC.OSC,CA.ON.OSC","XVRRRRRRRRTTTTTTTTTTTTT"
"22222222222222222222","BHC,NBC,MKY","HZSN7FQBPO5IEWYIGC72","BVC,AZX,CA.SK.FCAA,CA.NL.DSS","QQQQQQQQQRRCGHDKLKSLS"
现在您可以看到字段 4 的数据也包含逗号。 现在我需要字段 4 整理的数据如下:
"PT3QB789TSUIDF371261","THE TORONTO,DOMINION BANK","HZSN7FQBPO5IEWYIGC72","AAAA,ASIC*,CA.ON.OSC,MAS","XVCCCCCCCCCCYYUUUUU"
"11111111111111111111","ABC,XYZ,QWE","HZSN7FQBPO5IEWYIGC72","ABC,CA.ON.OSC,CA.QC.OSC,MAS,POU","XVRRRRRRRRTTTTTTTTTTTTT"
"22222222222222222222","BHC,NBC,MKY","HZSN7FQBPO5IEWYIGC72","AZX,BVC,CA.NL.DSS,CA.SK.FCAA","QQQQQQQQQRRCGHDKLKSLS"
为了获得这个解决方案,我编写了一个如下的脚本文件,但该解决方案似乎效率不高,因为 100k 条记录需要 20 分钟,所以试图获得有效的解决方案
#this command replaces the comma inside "" with | so that I can split the line based on ','(comma)
awk -F"\"" 'BEGIN{OFS="\""}{for(i=1;i<=NF;++i){ if(i%2==0) gsub(/,/, "|", $i)}} {print $0}' $FEED_FILE > temp.csv
count=0;
while read line
do
#break the line on comma ',' and get the array of strings.
IFS=',' read -ra data <<< "$line" #'data' is the array of the record of full line.
#take the 8th column, which is the reportable jurisdiction.
echo "REPORTABLE_JURISDICTION is : " ${data[4]}
#brake the data based on pipe '|' and sort the data
IFS='|' read -ra REPORTABLE_JURISDICTION_ARR <<< "${data[4]}"
#Sort this array
IFS=$'\n' sorted=($(sort <<<"${REPORTABLE_JURISDICTION_ARR[*]}"))
#printf "[%s]\n" "${sorted[@]}"
separator="|" # e.g. constructing regex, pray it does not contain %s
regex="$( printf "${separator}%s" "${sorted[@]}" )"
regex="${regex:${#separator}}" # remove leading separator
echo "${regex}"
data[4]=${regex}
echo "$data[68]"
#here we are building the whole line which will be written to the output file.
separator="," # e.g. constructing regex, pray it does not contain %s
regex="$( printf "${separator}%s" "${data[@]}" )"
regex="${regex:${#separator}}" # remove leading separator
echo "${regex}" >> temp2.csv
echo $count
((count++))
done < temp.csv
#remove the '|' from the and put the comma back
awk -F\| 'BEGIN{OFS=","} {$1=$1; print}' temp2.csv > temp3.csv
# to remove the tailing , if any
sed 's/,$//' temp3.csv > $OUT_FILE
如何让它更快?
您正在使用错误的工具来完成任务。 虽然 CSV 看起来非常简单,您可以使用 shell 工具轻松处理它,但是对于包含新行的单元格,您的代码会中断。 bash 在处理大量数据时也不是很快。
尝试使用直接理解 CSV 的工具(如http://csvkit.rtfd.org/)或使用 Python 等编程语言。 这允许您在不启动外部进程的情况下完成任务,语法更具可读性,结果将更易于维护。 注意:由于初始成本低,我建议使用 Python。
使用 python 和csv
模块,上面的代码看起来像这样:
import csv
FEED_FILE = '...'
OUT_FILE = '...'
with open(OUT_FILE, 'w', newline='') as out:
with open(FEED_FILE, newline='') as in:
reader = csv.reader(in, delimiter=',', quotechar='"')
writer = csv.writer(
for row in reader:
row[3] = sorted(list(row[3].split(',')))
writer.writerow(row)
也就是说,您的代码没有任何明显错误。 没有什么可以加快 awk 和 sed 的速度,而且就我所见,主 bash 循环不会产生许多外部进程。
使用单个awk :
awk 'BEGIN{ FS=OFS="\042,\042"}{ split($4,a,","); asort(a); sf=a[1];
for(i=2;i<=NF;i++) { sf=sf","a[i] } $4=sf; print $0 }' file > output.csv
output.csv内容:
"PT3QB789TSUIDF371261","THE TORONTO,DOMINION BANK","HZSN7FQBPO5IEWYIGC72","AAAA,ASIC*,CA.ON.OSC,MAS,","XVCCCCCCCCCCYYUUUUU"
"11111111111111111111","ABC,XYZ,QWE","HZSN7FQBPO5IEWYIGC72","ABC,CA.ON.OSC,CA.QC.OSC,MAS,POU","XVRRRRRRRRTTTTTTTTTTTTT"
"22222222222222222222","BHC,NBC,MKY","HZSN7FQBPO5IEWYIGC72","AZX,BVC,CA.NL.DSS,CA.SK.FCAA,","QQQQQQQQQRRCGHDKLKSLS"
FS=OFS="\\042,\\042"
- 考虑","
作为字段分隔符
split($4,a,",")
- 按分隔符将第 4 个字段拆分为数组,
asort(a)
- 按值对数组进行排序
在python3
尝试熊猫。 唯一的限制:数据需要适合内存。 这可能比您的实际数据大一点。 我使用这个脚本对 30.000.000 行的 CSV 文件进行了排序,没有任何问题,我很快写道:
import pandas as pd
import os, datetime, traceback
L1_DIR = '/mnt/ssd/ASCII/'
suffix = '.csv
for fname in sorted(os.listdir(L1_DIR)):
if not fname.endswith(suffix):
continue
print("Start processing %s" % fname)
s = datetime.datetime.now()
fin_path = os.path.join(L1_DIR, fname)
fname_out = fname.split('.')[0] + '.csv_sorted'
fpath_out = os.path.join(L1_DIR, fname_out)
df = pd.read_csv(fin_path)
e = datetime.datetime.now()
print("Read %s rows from %s. Took (%s)" % (len(df.index), fname, (e-s)))
s = datetime.datetime.now()
df.set_index('ts', inplace=True)
e = datetime.datetime.now()
print("set_index %s rows from %s. Took (%s)" % (len(df.index), fname, (e-s)))
s = datetime.datetime.now()
df.sort_index(inplace=True)
e = datetime.datetime.now()
print("sort_index %s rows from [%s] to [%s]. Took (%s)" % (len(df.index), fname, fname_out, (e-s)))
s = datetime.datetime.now()
df.reset_index(inplace=True)
# This one saves at ~10MB per second to disk.. One day is 7.5GB --> 750 seconds or 12.5 minutes
df.to_csv(fpath_out, index=False)
e = datetime.datetime.now()
print("to_csv %s rows from [%s] to [%s]. Took (%s)" % (len(df.index), fname, fname_out, (e - s)))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.