在每行上前两个字符拆分一个大的平面文件

Question

I have a large file ~10 GB that is comma delimited. 我有一个大文件~10 GB，以逗号分隔。 Each row starts with a 2 character code that tells what type of row it is as each row is a different type of event. 每行以2个字符的代码开头，该代码告诉它是什么类型的行，因为每行是不同类型的事件。 Currently I read the file into R, then use a regex to split it into different pieces based on code then write the resulting objects to a flat file. 目前我将文件读入R，然后使用正则表达式将其拆分为基于代码的不同部分，然后将生成的对象写入平面文件。

I'm curious if there's a more direct way to do this (read a row, determine row type and append the row to the appropriate flat file (there will be 7 total)) in Python, bash, sed/awk, etc. 我很好奇是否有更直接的方法来做这个（读取一行，确定行类型并将行附加到适当的平面文件（总共将有7个）），Python，bash，sed / awk等。

Data looks like this: 数据如下所示：

01,tim@bigcompany.com,20140101120000,campaign1
02,201420140101123000,123321,Xjq12090,TX
02,201420140101123000,123321,Xjq12090,AK
...

Any suggestions would be appreciated. 任何建议，将不胜感激。

Answer 1

Using awk you can do: 使用awk你可以做到：

awk -F, '{fn=$1 ".txt"; print > fn}' file

If you want to keep it clean by closing all file handles in the end use this awk : 如果你想通过最后关闭所有文件句柄来保持它的清洁，请使用以下awk ：

awk -F, '!($1 in files){files[$1]=$1 ".txt"} {print > files[$1]}
    END {for (f in files) close(files[$f])}' file

Answer 2

If you don't care about performance, or trust your OS/filesystem/drive's disk caching: 如果您不关心性能，或信任您的OS /文件系统/驱动器的磁盘缓存：

with open('hugedata.txt') as infile:
    for line in infile:
        with open(line[:2] + '.txt', 'a') as outfile:
            outfile.write(line)

However, constantly reopening and reclosing (and therefore flushing) the files is going to mean you never get the benefit of buffering, and there's only so much a disk cache can do to make up for that, so, you might want to consider pre-opening all the files. 然而，不断重新打开和重新关闭（并因此刷新）文件意味着你永远不会得到缓冲的好处，并且只有这么多的磁盘缓存可以做到这一点，所以，你可能想要考虑预先打开所有文件。 Since there are only 7 of them, that's pretty easy: 由于它们只有7个，这很容易：

files = { format(i, '{:02}'): open(format(i, '{:02}.txt'), 'w') for i in range(1, 8)}
try:
    with open('hugedata.txt') as infile:
        for line in infile:
            files[line[:2]].write(line)
finally:
    for file in files:
        file.close()

Or, more robustly: 或者，更强大：

files = collections.defaultdict(lambda s: open(s+'.txt', 'w'))
try:
    with open('hugedata.txt') as infile:
        for line in infile:                
            files[line[:2]].write(line)
finally:
    for file in files:
        file.close()

(You can write a with statement that does the closing automatically, but it'll be different in different Python versions; this is a bit clunky, but works with everything from 2.4 to 3.5, and probably beyond, and since you haven't told us your platform or Python version, it seemed safer.) （您可以编写一个自动关闭语句的with语句，但在不同的Python版本中它会有所不同;这有点笨拙，但适用于从2.4到3.5，可能超出的所有内容，因为您还没有告诉我们你的平台或Python版本，它似乎更安全。）

Answer 3

How about something like this in Python: 在Python中这样的事情怎么样：

for line in file('hugedata.txt'):
    fh = file(line[:2] + '.txt', 'a')
    fh.write(line)

Answer 4

I would do something like this: 我会做这样的事情：

grep '^01' your-10gb-file > 01.csv

You can then wrap this inside a foreach (for tcsh) like this: 然后你可以将它包装在foreach（对于tcsh）中，如下所示：

foreach n ( `seq -f '%02g' 7` )
    grep '^$n' your-10gb-file > $n.csv
end

Answer 5

from itertools import groupby
with open("largefile.txt") as f:
    for k,v in groupby(f,lambda x: x[:2]):
        with open("{}.txt".format(k),"w") as f1:
            f1.writelines(v)

在每行上前两个字符拆分一个大的平面文件

问题描述

5 个解决方案

解决方案1
7 已采纳 2014-11-10 22:03:06

解决方案2
2 2014-11-10 22:10:03

解决方案3
1 2014-11-10 22:02:11

解决方案4
1 2014-11-10 22:05:16

解决方案5
1 2014-11-10 22:15:21

在每行上前两个字符拆分一个大的平面文件

问题描述

5 个解决方案

解决方案1 7 已采纳 2014-11-10 22:03:06

解决方案2 2 2014-11-10 22:10:03

解决方案3 1 2014-11-10 22:02:11

解决方案4 1 2014-11-10 22:05:16

解决方案5 1 2014-11-10 22:15:21

解决方案1
7 已采纳 2014-11-10 22:03:06

解决方案2
2 2014-11-10 22:10:03

解决方案3
1 2014-11-10 22:02:11

解决方案4
1 2014-11-10 22:05:16

解决方案5
1 2014-11-10 22:15:21