用数百万行排序大文件

Question

I have tens of millions of strings in text file like these: 我在文本文件中有几千万个字符串，如下所示：

aa kk
bb mm
cc tt
ee ff
aa xx
bb ss
cc gg
ee rr

And I want to make them look like: 我想让它们看起来像：

aa kk,xx
bb mm,ss
cc tt,gg
ee ff,rr

I have tried to sort and rearrange it with grep, sed and other tools but it looks like it is very slow way on really huge files even with 我曾尝试使用grep，sed和其他工具对它进行排序和重新排列，但即使在使用大型文件时，它看起来也非常慢

LC_ALL=C grep something LC_ALL = C grep什么

Answer 1

I'm not clear if you specifically want to do this with just standard shell tools or not, but, Python is nearly universal on Linux these days. 我不清楚您是否特别想仅使用标准外壳程序工具来执行此操作，但是，如今，Python在Linux上几乎是通用的。 It can be done with a fairly simple program: 可以用一个相当简单的程序来完成：

#!/usr/bin/python

import sys

data = { }
while True:
    l = sys.stdin.readline()
    if len(l)==0:
        break
    a,b = l.split()
    data.setdefault(a, [ ]).append(b)

for k in sorted(data.keys()):
    vs = data[k]
    print k, ",".join(vs)

I ran it on 50,000,000 lines of data generated by the following C program, and it finishes in about 60 seconds of my years-old laptop: 我在以下C程序生成的5000万行数据上运行了它，并在我使用年限的笔记本电脑中大约60秒内完成了该过程：

#include <stdio.h>
#include <stdlib.h>
char letter() { return (rand() % (123-97)) + 97; }
void main(void)
{
  int i;
  for(i=0; i<50000000; i++)
    printf("%c%c%c %c%c%c\n",
           letter(), letter(), letter(),
           letter(), letter(), letter());
}

Answer 2

awk '{if(b[$1])b[$1] = b[$1]","; b[$1] = b[$1] $2 $3}; END{for(i in b)print i, b[i]}' file

Output: 输出：

aa kk,xx
bb mm,ss
cc tt,gg
ee ff,rr

Source: https://stackoverflow.com/a/26450166/3776858 资料来源： https : //stackoverflow.com/a/26450166/3776858

Answer 3

for the performance and memory conservative 对于性能和记忆保守

sort -u YourFile | awk '{if (Last == $1) {Linked=Linked","$2} else { if (Last != "") print Last " " Linked; Last=$1;Linked=$2}} END{print Last " " Linked}'

First sort reduce the scope and arrance in order that allow the awk to read line by line and not loading a huge array (due to million of lines you specify) The awk concatene while header is same as previous line and print if not. 第一种排序方式减小了范围和范围，以允许awk逐行读取并且不加载巨大的数组（由于您指定了数百万行）。awk并置，而标头与上一行相同，否则打印。 Add END for last group and a if for first line 为最后一组添加END，为第一行添加if

maybe a bit faster 也许快一点

sort -u YourFile | awk 'FNR==1{Last=$1;Linked=$2} FNR>1{if (Last == $1) {Linked=Linked","$2} else { print Last " " Linked; Last=$1;Linked=$2}} END{print Last " " Linked}'

Answer 4

如果必须处理非常大的数据集，建议您使用Map Reduce模式。例如Hadoop框架/ spark。在这里查看https://hadoop.apache.org

用数百万行排序大文件

问题描述

4 个解决方案

解决方案1
1 2015-06-07 19:58:18

解决方案2
1 2015-06-07 20:17:51

解决方案3
1 2015-06-08 06:58:29

解决方案4
0 2015-06-07 19:37:27

用数百万行排序大文件

问题描述

4 个解决方案

解决方案1 1 2015-06-07 19:58:18

解决方案2 1 2015-06-07 20:17:51

解决方案3 1 2015-06-08 06:58:29

解决方案4 0 2015-06-07 19:37:27

解决方案1
1 2015-06-07 19:58:18

解决方案2
1 2015-06-07 20:17:51

解决方案3
1 2015-06-08 06:58:29

解决方案4
0 2015-06-07 19:37:27