简体   繁体   English

在每行上前两个字符拆分一个大的平面文件

[英]Split a large flat file by first two characters on each line

I have a large file ~10 GB that is comma delimited. 我有一个大文件~10 GB,以逗号分隔。 Each row starts with a 2 character code that tells what type of row it is as each row is a different type of event. 每行以2个字符的代码开头,该代码告诉它是什么类型的行,因为每行是不同类型的事件。 Currently I read the file into R, then use a regex to split it into different pieces based on code then write the resulting objects to a flat file. 目前我将文件读入R,然后使用正则表达式将其拆分为基于代码的不同部分,然后将生成的对象写入平面文件。

I'm curious if there's a more direct way to do this (read a row, determine row type and append the row to the appropriate flat file (there will be 7 total)) in Python, bash, sed/awk, etc. 我很好奇是否有更直接的方法来做这个(读取一行,确定行类型并将行附加到适当的平面文件(总共将有7个)),Python,bash,sed / awk等。

Data looks like this: 数据如下所示:

01,tim@bigcompany.com,20140101120000,campaign1
02,201420140101123000,123321,Xjq12090,TX
02,201420140101123000,123321,Xjq12090,AK
...

Any suggestions would be appreciated. 任何建议,将不胜感激。

Using awk you can do: 使用awk你可以做到:

awk -F, '{fn=$1 ".txt"; print > fn}' file

If you want to keep it clean by closing all file handles in the end use this awk : 如果你想通过最后关闭所有文件句柄来保持它的清洁,请使用以下awk

awk -F, '!($1 in files){files[$1]=$1 ".txt"} {print > files[$1]}
    END {for (f in files) close(files[$f])}' file

If you don't care about performance, or trust your OS/filesystem/drive's disk caching: 如果您不关心性能,或信任您的OS /文件系统/驱动器的磁盘缓存:

with open('hugedata.txt') as infile:
    for line in infile:
        with open(line[:2] + '.txt', 'a') as outfile:
            outfile.write(line)

However, constantly reopening and reclosing (and therefore flushing) the files is going to mean you never get the benefit of buffering, and there's only so much a disk cache can do to make up for that, so, you might want to consider pre-opening all the files. 然而,不断重新打开和重新关闭(并因此刷新)文件意味着你永远不会得到缓冲的好处,并且只有这么多的磁盘缓存可以做到这一点,所以,你可能想要考虑预先打开所有文件。 Since there are only 7 of them, that's pretty easy: 由于它们只有7个,这很容易:

files = { format(i, '{:02}'): open(format(i, '{:02}.txt'), 'w') for i in range(1, 8)}
try:
    with open('hugedata.txt') as infile:
        for line in infile:
            files[line[:2]].write(line)
finally:
    for file in files:
        file.close()

Or, more robustly: 或者,更强大:

files = collections.defaultdict(lambda s: open(s+'.txt', 'w'))
try:
    with open('hugedata.txt') as infile:
        for line in infile:                
            files[line[:2]].write(line)
finally:
    for file in files:
        file.close()

(You can write a with statement that does the closing automatically, but it'll be different in different Python versions; this is a bit clunky, but works with everything from 2.4 to 3.5, and probably beyond, and since you haven't told us your platform or Python version, it seemed safer.) (您可以编写一个自动关闭语句的with语句,但在不同的Python版本中它会有所不同;这有点笨拙,但适用于从2.4到3.5,可能超出的所有内容,因为您还没有告诉我们你的平台或Python版本,它似乎更安全。)

How about something like this in Python: 在Python中这样的事情怎么样:

for line in file('hugedata.txt'):
    fh = file(line[:2] + '.txt', 'a')
    fh.write(line)

I would do something like this: 我会做这样的事情:

grep '^01' your-10gb-file > 01.csv

You can then wrap this inside a foreach (for tcsh) like this: 然后你可以将它包装在foreach(对于tcsh)中,如下所示:

foreach n ( `seq -f '%02g' 7` )
    grep '^$n' your-10gb-file > $n.csv
end
from itertools import groupby
with open("largefile.txt") as f:
    for k,v in groupby(f,lambda x: x[:2]):
        with open("{}.txt".format(k),"w") as f1:
            f1.writelines(v)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM