简体   繁体   English

用数百万行排序大文件

[英]Sorting huge files with millions of lines

I have tens of millions of strings in text file like these: 我在文本文件中有几千万个字符串,如下所示:

aa kk
bb mm
cc tt
ee ff
aa xx
bb ss
cc gg
ee rr

And I want to make them look like: 我想让它们看起来像:

aa kk,xx
bb mm,ss
cc tt,gg
ee ff,rr

I have tried to sort and rearrange it with grep, sed and other tools but it looks like it is very slow way on really huge files even with 我曾尝试使用grep,sed和其他工具对它进行排序和重新排列,但即使在使用大型文件时,它看起来也非常慢

LC_ALL=C grep something LC_ALL = C grep什么

I'm not clear if you specifically want to do this with just standard shell tools or not, but, Python is nearly universal on Linux these days. 我不清楚您是否特别想仅使用标准外壳程序工具来执行此操作,但是,如今,Python在Linux上几乎是通用的。 It can be done with a fairly simple program: 可以用一个相当简单的程序来完成:

#!/usr/bin/python

import sys

data = { }
while True:
    l = sys.stdin.readline()
    if len(l)==0:
        break
    a,b = l.split()
    data.setdefault(a, [ ]).append(b)

for k in sorted(data.keys()):
    vs = data[k]
    print k, ",".join(vs)

I ran it on 50,000,000 lines of data generated by the following C program, and it finishes in about 60 seconds of my years-old laptop: 我在以下C程序生成的5000万行数据上运行了它,并在我使用年限的笔记本电脑中大约60秒内完成了该过程:

#include <stdio.h>
#include <stdlib.h>
char letter() { return (rand() % (123-97)) + 97; }
void main(void)
{
  int i;
  for(i=0; i<50000000; i++)
    printf("%c%c%c %c%c%c\n",
           letter(), letter(), letter(),
           letter(), letter(), letter());
}
awk '{if(b[$1])b[$1] = b[$1]","; b[$1] = b[$1] $2 $3}; END{for(i in b)print i, b[i]}' file

Output: 输出:

aa kk,xx
bb mm,ss
cc tt,gg
ee ff,rr

Source: https://stackoverflow.com/a/26450166/3776858 资料来源: https : //stackoverflow.com/a/26450166/3776858

for the performance and memory conservative 对于性能和记忆保守

sort -u YourFile | awk '{if (Last == $1) {Linked=Linked","$2} else { if (Last != "") print Last " " Linked; Last=$1;Linked=$2}} END{print Last " " Linked}'

First sort reduce the scope and arrance in order that allow the awk to read line by line and not loading a huge array (due to million of lines you specify) The awk concatene while header is same as previous line and print if not. 第一种排序方式减小了范围和范围,以允许awk逐行读取并且不加载巨大的数组(由于您指定了数百万行)。awk并置,而标头与上一行相同,否则打印。 Add END for last group and a if for first line 为最后一组添加END,为第一行添加if

maybe a bit faster 也许快一点

sort -u YourFile | awk 'FNR==1{Last=$1;Linked=$2} FNR>1{if (Last == $1) {Linked=Linked","$2} else { print Last " " Linked; Last=$1;Linked=$2}} END{print Last " " Linked}'

如果必须处理非常大的数据集,建议您使用Map Reduce模式。例如Hadoop框架/ spark。在这里查看https://hadoop.apache.org

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM