简体   繁体   中英

How can I parse, sort, and print a 90MB JSON file with 100,000 records to CSV?

Background

I'm trying to complete a code challenge where I need to refactor a simple PHP application that accepts a JSON file of people, sorts them by registration date, and outputs them to a CSV file. The provided program is already functioning and works fine with a small input but intentionally fails with a large input . In order to complete the challenge, the program should be modified to be able to parse and sort a 100,000 record, 90MB file without running out of memory, like it does now.

In it's current state, the program uses file_get_contents() , followed by json_decode() , and then usort() to sort the items. This works fine with the small sample data file, however not with the large sample data file - it runs out of memory.

The input file

The file is in JSON format and contains 100,000 objects. Each object has a registered attribute (example value 2017-12-25 04:55:33 ) and this is how the records in the CSV file should be sorted, in ascending order.

My attempted solution

Currently, I've used the halaxa/json-machine package, and I'm able to iterate over each object in the file. For example

$people = \JsonMachine\JsonMachine::fromFile($fileName);
foreach ($people as $person) {
    // do something
}

Reading the whole file into memory as a PHP array is not an option, as it takes up too much memory, so the only solution I've been able to come up with so far has been iterating over each object in the file, finding the person with the earliest registration date and printing that. Then, iterating over the whole file again, finding the next person with the earliest registration date and printing that etc.

The big issue with that is that the nested loops: a loop which runs 100,000 times containing a loop that runs 100,000 times. It's not a viable solution, and that's the furthest I've made it.

How can I parse, sort, and print to CSV, a JSON file with 100,000 records? Usage of packages / services is allowed.

I ended up importing into MongoDB in chunks and then retrieving in the correct order to print

Example import:

$collection = (new Client($uri))->collection->people;
$collection->drop();

$people = JsonMachine::fromFile($fileName);

$chunk = [];
$chunkSize = 5000;
$personNumber = 0;
foreach ($people as $person) {
    $personNumber += 1;
    $chunk[] = $person;
    if ($personNumber % $chunkSize == 0) { // Chunk is full
        $this->collection->insertMany($chunk);
        $chunk = [];
    }
}
// The very last chunk was not filled to the max, but we still need to import it
if(count($chunk)) {
    $this->collection->insertMany($chunk);
}
// Create an index for quicker sorting
$this->collection->createIndex([ 'registered' => 1 ]);

Example retrieve:

$results = $this->collection->find([],
    [
        'sort' => ['registered' => 1],
    ]
);

// For every person...
foreach ($results as $person) {
    // For every attribute...
    foreach ($person as $key => $value) {
        if($key != '_id') { // No need to include the new MongoDB ID
            echo some_csv_encode_function($value) . ',';
        }
    }
    echo PHP_EOL;
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM