简体   繁体   中英

How to split a huge csv file based on content of first column?

  • I have a 250MB+ huge csv file to upload
  • file format is group_id, application_id, reading and data could look like
1, a1, 0.1 1, a1, 0.2 1, a1, 0.4 1, a1, 0.3 1, a1, 0.0 1, a1, 0.9 2, b1, 0.1 2, b1, 0.2 2, b1, 0.4 2, b1, 0.3 2, b1, 0.0 2, b1, 0.9..... n, x, 0.3(lets say)
  • I want to divide the file based on group_id , so output should be n files where n=group_id

Output

 File 1 1, a1, 0.1 1, a1, 0.2 1, a1, 0.4 1, a1, 0.3 1, a1, 0.0 1, a1, 0.9

and

File2 2, b1, 0.1 2, b1, 0.2 2, b1, 0.4 2, b1, 0.3 2, b1, 0.0 2, b1, 0.9.....

and

File nn, x, 0.3(lets say)

How can I do this effectively?

awk有能力:

 awk -F "," '{print $0 >> ("FILE" $1)}' HUGE.csv

If the file is already sorted by group_id , you can do something like:

import csv
from itertools import groupby

for key, rows in groupby(csv.reader(open("foo.csv")),
                         lambda row: row[0]):
    with open("%s.txt" % key, "w") as output:
        for row in rows:
            output.write(",".join(row) + "\n")

Sed one-liner:

sed -e '/^1,/wFile1' -e '/^2,/wFile2' -e '/^3,/wFile3' ... OriginalFile 

The only down-side is that you need to put in n -e statements (represented by the ellipsis, which shouldn't appear in the final version). So this one-liner might be a pretty long line.

The upsides, though, are that it only makes one pass through the file, no sorting is assumed, and no python is needed. Plus, it's a one-freaking-liner!

If the rows are sorted by group_id , then itertools.groupby would be useful here. Because it's an iterator, you won't have to load the whole file into memory; you can still write each file line by line. Use csv to load the file (in case you didn't already know about it).

If they are sorted by the group id you can use the csv module to iterate over the rows in the files and output it. You can find information about the module here .

How about:

  • Read the input file a line at a time
  • split() each line on , to get the group_id
  • For each new group_id you find, open an output file
    • add each groupid to a set/dict as you find them so you can keep track
  • write the line to the appropriate file
  • Done!

Here some food for though for you:

import csv
from collections import namedtuple

csvfile = namedtuple('scvfile',('file','writer'))

class CSVFileCollections(object):

    def __init__(self,prefix,postfix):
        self.prefix = prefix
        self.files = {}

    def __getitem__(self,item):
        if item not in self.files:
            file = open(self.prefix+str(item)+self.postfix,'wb')
            writer = csv.writer(file,delimiter = ',', quotechar = "'",quoting=csv.QUOTE_MINIMAL)
            self.files[item] = csvfile(file,writer) 
        return self.files[item].writer

    def __enter__(self): pass

    def __exit__(self, exc_type, exc_value, traceback):
        for csvfile in self.files.values() : csvfile.file.close()


with open('huge.csv') as readFile, CSVFileCollections('output','.csv') as output:
    reader = csv.reader(readFile, delimiter=",", quotechar="'")
    for row in reader:
        writer = output[row[0]]
        writer.writerow(row)

Here's a solution that works with the IDs sorted or unsorted. The only overhead for the unsorted version is opening the destination (group ID) CSV multiple times:

import csv

reader = csv.reader(open("test.csv", newline=""))

prev_id = None
out_file = None
writer = None

for row in reader:
    this_id = row[0]

    if this_id != prev_id:
        if out_file is not None:
            out_file.close()

        fname = f"file_{this_id}.csv"
        out_file = open(fname, "a", newline="")
        writer = csv.writer(out_file)
        prev_id = this_id

    writer.writerow(row)

Here's test input, but now with 1 & 2 interleaved:

1, a1, 0.1
2, b1, 0.1
1, a1, 0.2
2, b1, 0.2
1, a1, 0.4
2, b1, 0.4
1, a1, 0.3
2, b1, 0.3
1, a1, 0.0
2, b1, 0.0
1, a1, 0.9
2, b1, 0.9

When I run it I see:

./main.py
opening file_1.csv for appending...
opening file_2.csv for appending...
opening file_1.csv for appending...
opening file_2.csv for appending...
opening file_1.csv for appending...
opening file_2.csv for appending...
opening file_1.csv for appending...
opening file_2.csv for appending...
opening file_1.csv for appending...
opening file_2.csv for appending...
opening file_1.csv for appending...
opening file_2.csv for appending...

and my output files look like:

1, a1, 0.1
1, a1, 0.2
1, a1, 0.4
1, a1, 0.3
1, a1, 0.0
1, a1, 0.9

and

2, b1, 0.1
2, b1, 0.2
2, b1, 0.4
2, b1, 0.3
2, b1, 0.0
2, b1, 0.9

I also created a fake BIG file, 289MB, with 100 ID groups (250_000 rows per ID), and my solution ran in about 12 seconds. For comparison, the accepted answer that uses groupby() runs in about 10 seconds on the big CSV; the high-rated awk script runs in about a minute.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM