简体   繁体   English

PHP 从一个数组创建多个给定大小的 json 文件

[英]PHP create from one array multiple json files of given size

I want to create multiple json files (file1.json, file2.json, etc.) from one array and each file must have max file size 5 mb.我想从一个数组创建多个 json 文件(file1.json、file2.json 等),每个文件的最大文件大小必须为 5 mb。

I have this kind of array:我有这种数组:

    array (
    0 => array (
        'category' => '179535',
        'email' => NULL,
        'level' => 1,
        'name' => 'FOO'
    ),
    1 => array (
        'category' => '1795',
        'email' => NULL,
        'level' => 1,
        'name' => 'BARFOO'
    ),
    2 => array (
        'category' => '16985',
        'email' => NULL,
        'level' => 1,
        'name' => 'FOOBAR'
    ),
    ....
    
    25500 => array (
        'category' => '10055',
        'email' => NULL,
        'level' => 1,
        'name' => 'FOOBARBAR'
    )    
)

If I write it in a file with json_encode($arr).如果我用 json_encode($arr) 将它写在一个文件中。 The resulting file will be approximately 85mb.生成的文件大约为 85mb。 So how can I split this array to have a maximum of 5 mb per file?那么如何拆分此数组以使每个文件最大为 5 mb?

The most performance-friendly option, assuming your data is reasonably symmetric, would be to simply use array_chunk() to cut your array into chunks that, when json_encode d, will be approximately the expected size.假设您的数据相当对称,最性能友好的选项是简单地使用array_chunk()将您的数组切割成块,当json_encode d 时,将接近预期的大小。 Let's look at a sampling from your array:让我们看一下您的数组中的一个样本:

string(58) "{"category":"1795","email":null,"level":1,"name":"BARFOO"}"

The "name" here seems to be the only one likely to vary more significantly.这里的“名称”似乎是唯一可能变化更显着的名称。 Let's average it at 12 characters, you'd have a string length of 64 bytes per item.让我们将其平均为 12 个字符,每个项目的字符串长度为 64 个字节。 Then, you could fit 78125 of those into 5MB.然后,您可以将其中的 78125 个放入 5MB 中。 To keep it under the mark, let's make it 75000. Then, $chunks = array_chunk($data, 75000) would give you X chunks that'd be around or a bit under the 5MB mark.为了保持在标记之下,让我们将其设为 75000。然后, $chunks = array_chunk($data, 75000)将给你 X 块大约或略低于 5MB 标记。

Now, if you wanted to be more precise, and if the size really matters... We can:现在,如果您想要更精确,并且尺寸真的很重要……我们可以:

$size = 0; // size counter
$chunkno = 1; // chunk number
$maxbytes = 50000; // 50000-byte chunks
$chunks = []; // for array chunks

foreach($data as $set) {
    // if over the limit, move on to next chunk
    if ($size > $maxbytes) { 
        $size = 0;
        $chunkno++;
    }
    $size += strlen(json_encode($set)) + 1; // add a comma's length!
    $chunks[$chunkno][] = $set;
}
// unset($data); // in case you have memory concerns

Here we're obviously doing double-duty with the json_encode, but chunk size will not be impacted by variance in your source data.在这里,我们显然对 json_encode 执行双重任务,但块大小不会受到源数据差异的影响。 I ran the test script above for 50000-byte chunks, you'll want to have 5000000 instead for your use case.我为 50000 字节的块运行了上面的测试脚本,您需要5000000来代替您的用例。 The dummy data I generated split into neat 50K chunks, max.我生成的虚拟数据分成整齐的 50K 块,最大。 +/- size of one set, plus the remainder in the last file.一组的 +/- 大小,加上最后一个文件中的其余部分。

While mulling over this, I also played with the thought of doing strlen(implode( instead, but given the generally great performance of PHP's json_encode , there shouldn't be much of a penalty there, for the trade-off for getting the exact JSON string size.在考虑这一点的同时,我还考虑了做strlen(implode(代替,但考虑到 PHP 的json_encode的一般性能很好,那里不应该有太多的惩罚,为了获得准确的 JSON字符串大小。

In any case, once the chunks are ready, all we need to do is write 'em up:无论如何,一旦块准备好,我们需要做的就是把它们写下来:

foreach($chunks as $n => $chunk) {
    $json = json_encode($chunk);
    file_put_contents("tmp/chunk_{$n}.json", $json);
}

... or matching whatever your chunk naming and directory schema may be. ...或匹配您的块命名和目录架构可能是什么。

Perhaps there are more clever ways of doing this.也许有更聪明的方法可以做到这一点。 That said, as far as I'm aware, nothing in core PHP will do this sort of an operation out of the box (even for vanilla arrays), and the above should perform reasonably well.也就是说,据我所知,核心 PHP 中的任何内容都不会执行这种开箱即用的操作(即使对于普通数组也是如此),并且上述内容应该表现得相当好。 Remember to have enough memory available.请记住有足够的 memory 可用。 :) :)

PS In calculating the size, we add +1 for each item, standing for {},{},{} , or the object delimiters. PS 在计算大小时,我们为每个项目添加 +1,代表{},{},{}或 object 分隔符。 Strictly speaking, you'd also want to add +2 to the grand total, because it'll be [{},{},{}] , while we're only counting the length of each array item as a separate JSON object.严格来说,您还想将 +2 添加到总计中,因为它将是[{},{},{}] ,而我们只是将每个数组项的长度计算为单独的 JSON object . With other data structures, your compensation mileage may vary.使用其他数据结构,您的补偿里程可能会有所不同。


Optimization Update: If you choose the "exact size" approach and want to optimize memory usage, you're better off integrating the JSON commit into the chunking loop.优化更新:如果您选择“精确大小”方法并希望优化 memory 的使用,您最好将 JSON 提交集成到分块循环中。 (Thanks @NigelRen for the suggestion.) As follows (other initial variables as before): (感谢@NigelRen 的建议。)如下(其他初始变量如前):

$chunk = [];
foreach($data as $n => $set) {
    if ($size > $maxbytes) {
        file_put_contents("tmp/chunk_{$chunkno}.json", json_encode($chunk));
        $chunk = [];
        $chunkno++;
        $size = 0;
    }
    $size += strlen(json_encode($set)) + 1;
    $chunk[] = $set;
    //  unset($data[$n]); // in case of memory issues, see notes
}

In case you're curious over the impact.如果您对影响感到好奇。 With this approach, memory usage comes to (used, max) 1.06 MB, 29.34 MB.使用这种方法,memory 使用量达到(已使用,最大)1.06 MB、29.34 MB。 With the separate write routine, 26.29 MB, 31.8 MB.使用单独的写入例程,26.29 MB,31.8 MB。 Both figures include the unset($data) call, nixing the initial array and freeing up memory.这两个数字都包括unset($data)调用、nixing 初始数组和释放 memory。 CPU-wise, no significant difference between the two options. CPU 方面,这两个选项之间没有显着差异。

One could also purge members of the $data array each time after adding to the $chunk[] , however at 5MB chunk size the memory benefit here is negligible.也可以在每次添加到$chunk[]后清除$data数组的成员,但是在 5MB 块大小时,memory 的好处可以忽略不计。 It's the loading/defining of the initial array itself that's expensive, being the major factor in the max memory usage figure.初始数组本身的加载/定义是昂贵的,是最大 memory 使用数字的主要因素。 (The test array I used occupied 29.25 MB before any processing began.) (在开始任何处理之前,我使用的测试数组占用了 29.25 MB。)

You can get the strlen in Bytes and do your calculations from there:您可以以字节为单位获取strlen并从那里进行计算:

$total_size  = strlen(json_encode($array)) / 1024 / 1024;
$chunk_size  = floor($total_size / 5);
$chunked_array = array_chunk($array, $chunk_size);    

foreach($chunked_array as $key => $chunk) {
    $i = $key + 1;
    file_put_contents("file{$i}.json", json_encode($chunk));
}
  • Get the total size in Bytes of the JSON encoded array and convert to MB获取 JSON 编码数组的总大小(以字节为单位)并转换为 MB
  • Divide that total size by 5MB to get chunk size将该总大小除以 5MB 以获得块大小
  • Chunk the array into chunk size将数组分块成块大小
  • Loop and JSON encode each chunk and write to file循环和 JSON 编码每个块并写入文件

Or for the calculation you could do:或者对于您可以执行的计算:

$total_size  = strlen(json_encode($array));
$chunk_size  = floor($total_size / (5 * 1024 * 1024));

Let's assume that each item has the same structure thus:让我们假设每个项目具有相同的结构,因此:

1500 items ~= 5MB 1500 项 ~= 5MB

 25500 items = ~85MB

 85MB / 5MB = 17 
 
 25500 / 17 = 1500 items

code can be something like that:代码可以是这样的:

foreach(array_chunk($array, 1500) as $arr){

 // save array in some file

}

Please try this work around:请尝试以下解决方法:

<?php
    $array = array (
        0 => array (
            'category' => '179535',
            'email' => NULL,
            'level' => 1,
            'name' => 'FOO'
        ),
        1 => array (
            'category' => '1795',
            'email' => NULL,
            'level' => 1,
            'name' => 'BARFOO'
        ),
        2 => array (
            'category' => '16985',
            'email' => NULL,
            'level' => 1,
            'name' => 'FOOBAR'
        )
    );

    $len = sizeof($array);
    $fileNameIndex = 1;
    for($i=0;$i<$len;$i++)
    {
        $fileName = 'file'.$fileNameIndex.'.json';
        $fileExist = file_exists($fileName);
        $fileSize = 0;
        $mode ='w';
        $current = null;
        if($fileExist)
        {
            $fileSize = fileSize($fileName);
            $current = json_decode(file_get_contents($fileName), true);
        }
        if($fileExist && $fileSize < 5242880)
        {
            WriteToFile($fileNameIndex, $current, $array[$i], $i);
        }
        else if(!$fileExist)
        {
            WriteToFile($fileNameIndex, $current, $array[$i], $i);
        }
        else
        {
            $fileNameIndex ++;
            WriteToFile($fileNameIndex, $current, $array[$i], $i);
        }
    }

    function WriteToFile($fileNameIndex, $current, $data, $i)
    {
        $fileName = 'file'.$fileNameIndex.'.json';
        $mode ='w';
        echo "$i index array is being written in $fileName. <br/>";
        $fileNameIndex ++;
        $fp = fopen($fileName, $mode);
        if($current)
        {
            array_push($current, $data);
        }
        else
        {
            $current = [];
            array_push($current, $data);
        }
        fwrite($fp, json_encode($current));
        fclose($fp);
    }
?>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM