简体   繁体   English

在分隔文件中有效地转换 yyyymmdd

[英]efficiently convert yyyymmdd in delimited files

I have large dat file (zipped) in pipe delimited format.我有管道分隔格式的大 dat 文件(压缩)。 The 6th column has date format yyyymmdd, which needs to be converted to yyyy-mm-dd, so that I can use bcp utility to load to a sql server.第 6 列的日期格式为 yyyymmdd,需要将其转换为 yyyy-mm-dd,以便我可以使用 bcp 实用程序加载到 sql server。

The content of the file looks like (5.26million rows, uncompressed about 0.46G)文件内容如下(526万行,未压缩约0.46G)

AA|AA|AAA|AAAAA|AAAA|20191001|AA|...
BB|BB|BBB|BBBBB|BBBB|20191001|BB|...

My question is why the command line awk is slower than the python script.我的问题是为什么命令行 awk 比 python 脚本慢。 Is there a faster way to do it in the command line with awk or sed?在命令行中使用 awk 或 sed 是否有更快的方法? There are a total 14 columns per line.每行共有 14 列。 The python code split the line and stops at the 6th column. python 代码拆分行并在第 6 列处停止。 This may contribute to the difference.这可能会导致差异。 But columns after 6th are not much longer in general.但是第 6 列之后的列通常不会更长。

]$ time unzip -p /aa/bb/somedata.zip | awk 'BEGIN{FS=OFS="|"} { $6=substr($6,1,4) "-" substr($6,5,2) "-" substr($6,7,2);} 1' > data1.DAT
real 0m19.141s
user 0m20.219s

]$ time ~/test.py;
real 0m12.393s
user 0m10.458s

]$ time unzip -p /aa/bb/somedata.zip > data2.DAT
real 0m1.952s
user 0m1.619s

The python code looks like: python代码如下所示:

with zipfile.ZipFile('/aa/bb/somedata.zip', 'r') as zf:
    with zf.open(somedata.DAT, 'r') as f, open('/aa/bb/data2bcp.DAT', 'w') as wf:
        for line in TextIOWrapper(f):
            toks = line.split('|', 6)
            toks[5] = toks[5][0:4] +'-' + toks[5][4:6] + '-' + toks[5][6:8]
            wf.write('|'.join(toks))

After I remove the limit on split the run time of python is not much different在我取消 split 的限制后,python 的运行时间没有太大区别

toks = line.split('|')

] $ time ~/test.py;
real 0m13.006s
user 0m12.164s

You may try below if you have gawk , compare performance of substr() and gensub() with your large dataset如果你有gawk ,你可以在下面尝试,将substr()gensub()与你的大数据集进行比较

$ cat infile
AA|AA|AAA|AAAAA|AAAA|20191001|AA|...
BB|BB|BBB|BBBBB|BBBB|20191001|BB|...   

$ awk  'BEGIN{FS=OFS="|"}{$6=gensub(/(....)(..)(..)/,"\\1-\\2-\\3","",$6)}1' infile
AA|AA|AAA|AAAAA|AAAA|2019-10-01|AA|...
BB|BB|BBB|BBBBB|BBBB|2019-10-01|BB|...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM