[英]Sorting huge files with millions of lines
I have tens of millions of strings in text file like these: 我在文本文件中有几千万个字符串,如下所示:
aa kk bb mm cc tt ee ff aa xx bb ss cc gg ee rr
And I want to make them look like: 我想让它们看起来像:
aa kk,xx bb mm,ss cc tt,gg ee ff,rr
I have tried to sort and rearrange it with grep, sed and other tools but it looks like it is very slow way on really huge files even with 我曾尝试使用grep,sed和其他工具对它进行排序和重新排列,但即使在使用大型文件时,它看起来也非常慢
LC_ALL=C grep something LC_ALL = C grep什么
I'm not clear if you specifically want to do this with just standard shell tools or not, but, Python is nearly universal on Linux these days. 我不清楚您是否特别想仅使用标准外壳程序工具来执行此操作,但是,如今,Python在Linux上几乎是通用的。 It can be done with a fairly simple program: 可以用一个相当简单的程序来完成:
#!/usr/bin/python
import sys
data = { }
while True:
l = sys.stdin.readline()
if len(l)==0:
break
a,b = l.split()
data.setdefault(a, [ ]).append(b)
for k in sorted(data.keys()):
vs = data[k]
print k, ",".join(vs)
I ran it on 50,000,000 lines of data generated by the following C program, and it finishes in about 60 seconds of my years-old laptop: 我在以下C程序生成的5000万行数据上运行了它,并在我使用年限的笔记本电脑中大约60秒内完成了该过程:
#include <stdio.h>
#include <stdlib.h>
char letter() { return (rand() % (123-97)) + 97; }
void main(void)
{
int i;
for(i=0; i<50000000; i++)
printf("%c%c%c %c%c%c\n",
letter(), letter(), letter(),
letter(), letter(), letter());
}
awk '{if(b[$1])b[$1] = b[$1]","; b[$1] = b[$1] $2 $3}; END{for(i in b)print i, b[i]}' file
Output: 输出:
aa kk,xx bb mm,ss cc tt,gg ee ff,rr
Source: https://stackoverflow.com/a/26450166/3776858 资料来源: https : //stackoverflow.com/a/26450166/3776858
for the performance and memory conservative 对于性能和记忆保守
sort -u YourFile | awk '{if (Last == $1) {Linked=Linked","$2} else { if (Last != "") print Last " " Linked; Last=$1;Linked=$2}} END{print Last " " Linked}'
First sort reduce the scope and arrance in order that allow the awk to read line by line and not loading a huge array (due to million of lines you specify) The awk concatene while header is same as previous line and print if not. 第一种排序方式减小了范围和范围,以允许awk逐行读取并且不加载巨大的数组(由于您指定了数百万行)。awk并置,而标头与上一行相同,否则打印。 Add END for last group and a if for first line 为最后一组添加END,为第一行添加if
maybe a bit faster 也许快一点
sort -u YourFile | awk 'FNR==1{Last=$1;Linked=$2} FNR>1{if (Last == $1) {Linked=Linked","$2} else { print Last " " Linked; Last=$1;Linked=$2}} END{print Last " " Linked}'
如果必须处理非常大的数据集,建议您使用Map Reduce模式。例如Hadoop框架/ spark。在这里查看https://hadoop.apache.org
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.