简体   繁体   中英

PHP create from one array multiple json files of given size

I want to create multiple json files (file1.json, file2.json, etc.) from one array and each file must have max file size 5 mb.

I have this kind of array:

    array (
    0 => array (
        'category' => '179535',
        'email' => NULL,
        'level' => 1,
        'name' => 'FOO'
    ),
    1 => array (
        'category' => '1795',
        'email' => NULL,
        'level' => 1,
        'name' => 'BARFOO'
    ),
    2 => array (
        'category' => '16985',
        'email' => NULL,
        'level' => 1,
        'name' => 'FOOBAR'
    ),
    ....
    
    25500 => array (
        'category' => '10055',
        'email' => NULL,
        'level' => 1,
        'name' => 'FOOBARBAR'
    )    
)

If I write it in a file with json_encode($arr). The resulting file will be approximately 85mb. So how can I split this array to have a maximum of 5 mb per file?

The most performance-friendly option, assuming your data is reasonably symmetric, would be to simply use array_chunk() to cut your array into chunks that, when json_encode d, will be approximately the expected size. Let's look at a sampling from your array:

string(58) "{"category":"1795","email":null,"level":1,"name":"BARFOO"}"

The "name" here seems to be the only one likely to vary more significantly. Let's average it at 12 characters, you'd have a string length of 64 bytes per item. Then, you could fit 78125 of those into 5MB. To keep it under the mark, let's make it 75000. Then, $chunks = array_chunk($data, 75000) would give you X chunks that'd be around or a bit under the 5MB mark.

Now, if you wanted to be more precise, and if the size really matters... We can:

$size = 0; // size counter
$chunkno = 1; // chunk number
$maxbytes = 50000; // 50000-byte chunks
$chunks = []; // for array chunks

foreach($data as $set) {
    // if over the limit, move on to next chunk
    if ($size > $maxbytes) { 
        $size = 0;
        $chunkno++;
    }
    $size += strlen(json_encode($set)) + 1; // add a comma's length!
    $chunks[$chunkno][] = $set;
}
// unset($data); // in case you have memory concerns

Here we're obviously doing double-duty with the json_encode, but chunk size will not be impacted by variance in your source data. I ran the test script above for 50000-byte chunks, you'll want to have 5000000 instead for your use case. The dummy data I generated split into neat 50K chunks, max. +/- size of one set, plus the remainder in the last file.

While mulling over this, I also played with the thought of doing strlen(implode( instead, but given the generally great performance of PHP's json_encode , there shouldn't be much of a penalty there, for the trade-off for getting the exact JSON string size.

In any case, once the chunks are ready, all we need to do is write 'em up:

foreach($chunks as $n => $chunk) {
    $json = json_encode($chunk);
    file_put_contents("tmp/chunk_{$n}.json", $json);
}

... or matching whatever your chunk naming and directory schema may be.

Perhaps there are more clever ways of doing this. That said, as far as I'm aware, nothing in core PHP will do this sort of an operation out of the box (even for vanilla arrays), and the above should perform reasonably well. Remember to have enough memory available. :)

PS In calculating the size, we add +1 for each item, standing for {},{},{} , or the object delimiters. Strictly speaking, you'd also want to add +2 to the grand total, because it'll be [{},{},{}] , while we're only counting the length of each array item as a separate JSON object. With other data structures, your compensation mileage may vary.


Optimization Update: If you choose the "exact size" approach and want to optimize memory usage, you're better off integrating the JSON commit into the chunking loop. (Thanks @NigelRen for the suggestion.) As follows (other initial variables as before):

$chunk = [];
foreach($data as $n => $set) {
    if ($size > $maxbytes) {
        file_put_contents("tmp/chunk_{$chunkno}.json", json_encode($chunk));
        $chunk = [];
        $chunkno++;
        $size = 0;
    }
    $size += strlen(json_encode($set)) + 1;
    $chunk[] = $set;
    //  unset($data[$n]); // in case of memory issues, see notes
}

In case you're curious over the impact. With this approach, memory usage comes to (used, max) 1.06 MB, 29.34 MB. With the separate write routine, 26.29 MB, 31.8 MB. Both figures include the unset($data) call, nixing the initial array and freeing up memory. CPU-wise, no significant difference between the two options.

One could also purge members of the $data array each time after adding to the $chunk[] , however at 5MB chunk size the memory benefit here is negligible. It's the loading/defining of the initial array itself that's expensive, being the major factor in the max memory usage figure. (The test array I used occupied 29.25 MB before any processing began.)

You can get the strlen in Bytes and do your calculations from there:

$total_size  = strlen(json_encode($array)) / 1024 / 1024;
$chunk_size  = floor($total_size / 5);
$chunked_array = array_chunk($array, $chunk_size);    

foreach($chunked_array as $key => $chunk) {
    $i = $key + 1;
    file_put_contents("file{$i}.json", json_encode($chunk));
}
  • Get the total size in Bytes of the JSON encoded array and convert to MB
  • Divide that total size by 5MB to get chunk size
  • Chunk the array into chunk size
  • Loop and JSON encode each chunk and write to file

Or for the calculation you could do:

$total_size  = strlen(json_encode($array));
$chunk_size  = floor($total_size / (5 * 1024 * 1024));

Let's assume that each item has the same structure thus:

1500 items ~= 5MB

 25500 items = ~85MB

 85MB / 5MB = 17 
 
 25500 / 17 = 1500 items

code can be something like that:

foreach(array_chunk($array, 1500) as $arr){

 // save array in some file

}

Please try this work around:

<?php
    $array = array (
        0 => array (
            'category' => '179535',
            'email' => NULL,
            'level' => 1,
            'name' => 'FOO'
        ),
        1 => array (
            'category' => '1795',
            'email' => NULL,
            'level' => 1,
            'name' => 'BARFOO'
        ),
        2 => array (
            'category' => '16985',
            'email' => NULL,
            'level' => 1,
            'name' => 'FOOBAR'
        )
    );

    $len = sizeof($array);
    $fileNameIndex = 1;
    for($i=0;$i<$len;$i++)
    {
        $fileName = 'file'.$fileNameIndex.'.json';
        $fileExist = file_exists($fileName);
        $fileSize = 0;
        $mode ='w';
        $current = null;
        if($fileExist)
        {
            $fileSize = fileSize($fileName);
            $current = json_decode(file_get_contents($fileName), true);
        }
        if($fileExist && $fileSize < 5242880)
        {
            WriteToFile($fileNameIndex, $current, $array[$i], $i);
        }
        else if(!$fileExist)
        {
            WriteToFile($fileNameIndex, $current, $array[$i], $i);
        }
        else
        {
            $fileNameIndex ++;
            WriteToFile($fileNameIndex, $current, $array[$i], $i);
        }
    }

    function WriteToFile($fileNameIndex, $current, $data, $i)
    {
        $fileName = 'file'.$fileNameIndex.'.json';
        $mode ='w';
        echo "$i index array is being written in $fileName. <br/>";
        $fileNameIndex ++;
        $fp = fopen($fileName, $mode);
        if($current)
        {
            array_push($current, $data);
        }
        else
        {
            $current = [];
            array_push($current, $data);
        }
        fwrite($fp, json_encode($current));
        fclose($fp);
    }
?>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM