简体   繁体   English

使用PHP将大文件分解为许多小文件

[英]Break A Large File Into Many Smaller Files With PHP

I have a 209MB .txt file with about 95,000 lines that is automatically pushed to my server once a week to update some content on my website. 我有一个209MB的.txt文件,大约有95,000行,该文件每周一次自动推送到我的服务器上,以更新网站上的某些内容。 The problem is I cannot allocate enough memory to process such a large file, so I want to break the large file into smaller files with 5,000 lines each. 问题是我无法分配足够的内存来处理如此大的文件,因此我想将大文件分解成每个有5,000行的较小文件。

I cannot use file() at all until the file is broken into smaller pieces, so I have been working with SplFileObject. 在将文件分解成更小的部分之前,我根本无法使用file(),因此我一直在使用SplFileObject。 But I have gotten nowhere with it. 但是我却一无所获。 Here's some pseudocode of what I want to accomplish: 这是我要完成的一些伪代码:

read the file contents

while there are still lines left to be read in the file
    create a new file
    write the next 5000 lines to this file
    close this file

for each file created
    run mysql update queries with the new content

delete all of the files that were created

The file is in csv format. 该文件为csv格式。

EDIT: Here is the solution for reading the file by line given the answers below: 编辑:这是给出以下答案的按行读取文件的解决方案:

function getLine($number) {
    global $handle, $index;
    $offset = $index[$number];
    fseek($handle, $offset);
    return explode("|",fgets($handle));
}

$handle = @fopen("content.txt", "r");

while (false !== ($line = fgets($handle))) {
    $index[] = ftell($handle);
}

print_r(getLine(18437));

fclose($handle);
//MySQL Connection Stuff goes here

$handle = fopen('/path/to/bigfile.txt','r');  //open big file with fopen
$f = 1; //new file number

while(!feof($handle))
{
    $newfile = fopen('/path/to/newfile' . $f . '.txt','w'); //create new file to write to with file number
    for($i = 1; $i <= 5000; $i++) //for 5000 lines
    {
        $import = fgets($handle);
        fwrite($newfile,$import);
        if(feof($handle))
        {break;} //If file ends, break loop
    }
    fclose($newfile);
    //MySQL newfile insertion stuff goes here
    $f++; //Increment newfile number
}
fclose($handle);

This should work, the big file should go through 5000 lines per file, and output files like newfile1.txt, newfile2.txt, etc., can be adjusted by the $i <= 5000 bit in the for loop. 这应该可以工作,大文件每个文件应经过5000行,并且输出文件(如newfile1.txt,newfile2.txt等)可以通过for循环中的$i <= 5000位进行调整。

Oh, I see, you want to do insertion on the data from the big file, not store the info about the files. 哦,我知道,您想插入大文件中的数据,而不是存储有关文件的信息。 Then just use fopen/fgets and insert until feof. 然后只需使用fopen / fgets并插入直到feof。

If your big file is in CSV format, I guess that you need to process it line by line and don't actually need to break it into smaller files. 如果您的大文件为CSV格式,那么我想您需要逐行处理它,而实际上并不需要将其分解为较小的文件。 There should be no need to hold 5.000 or more lines in memory at once! 不必一次在内存中保留5.000或更多行! To do that, simply use PHP's "low-level" file functions: 为此,只需使用PHP的“低级”文件功能:

$fp = fopen("path/to/file", "r");

while (false !== ($line = fgets($fp))) {
    // Process $line, e.g split it into values since it is CSV.
    $values = explode(",", $line);

    // Do stuff: Run MySQL updates, ...
}

fclose($fp);

If you need random-access, eg read a line by line number, you could create a "line index" for your file: 如果您需要随机访问(例如逐行读取),则可以为文件创建“行索引”:

$fp = fopen("path/to/file", "r");

$index = array(0);

while (false !== ($line = fgets($fp))) {
    $index[] = ftell($fp);  // get the current byte offset
}

Now $index maps line numbers to byte offsets and you can navigate to a line by using fseek() : 现在$index将行号映射到字节偏移量,您可以使用fseek()导航到一行:

function get_line($number)
{
    global $fp, $index;
    $offset = $index[$number];
    fseek($fp, $offset);
    return fgets($fp);
}

$line10 = get_line(10);

// ... Once you are done:
fclose($fp);

Note that I started line counting at 0, unlike text editors. 请注意,与文本编辑器不同,我从0开始计数。

You can use fgets to read line by line. 您可以使用fgets逐行读取。

You'll need to create a function to put the read contents to a new file. 您需要创建一个函数,将读取的内容放入新文件。 Example: 例:

function load(startLine) {
    read the original file from a point startline
    puts the content into new file
}

After this, you can call this function recursively to pass startline on the function in each cicle of reading. 在此之后,你可以递归调用这个函数传递startline的功能在阅读各cicle。

This should do the trick for you, I don't have a very large text file, but I tested with a file that is 1300 lines long and it split the file into 3 files: 这应该为您解决问题,我没有很大的文本文件,但是我测试了一个1300行长的文件,并将其分为3个文件:

    // Store the line no:
    $i = 0;
    // Store the output file no:
    $file_count = 1;
    // Create a handle for the input file:
    $input_handle = fopen('test.txt', "r") or die("Can't open output file.");
    // Create an output file:
    $output_handle = fopen('test-'.$file_count.'.txt', "w") or die("Can't open output file.");

    // Loop through the file until you get to the end:
    while (!feof($input_handle)) 
    {
        // Read from the file:
        $buffer = fgets($input_handle);
        // Write the read data from the input file to the output file:
        fwrite($output_handle, $buffer);
        // Increment the line no:
        $i++;
        // If on the 5000th line:
        if ($i==5000)
        {
            // Reset the line no:
            $i=0;
            // Close the output file:
            fclose($output_handle);
            // Increment the output file count:
            $file_count++;
            // Create the next output file:
            $output_handle = fopen('test-'.$file_count.'.txt', "w") or die("Can't open output file.");
        }
    }
    // Close the input file:
    fclose($input_handle);
    // Close the output file:
    fclose($output_handle);

The problem you may now find is the execution time is too long for the script when you are talking about a 200+mb file. 您现在可能发现的问题是,当您谈论200 + mb文件时,脚本的执行时间太长。

If this is running on a linux server simply have php make the command line execute the following: 如果这是在linux服务器上运行的,只需让php使命令行执行以下命令:

split -l 5000 -a 4 test.txt out

Then glob the results for file names which you can fopen on. 然后在结果中搜索可以打开的文件名。


I think your algo is awkward, it looks like you're breaking up files for no reason. 我认为您的算法很尴尬,似乎您无缘无故地在拆分文件。 If you simply fopen the initial data file and read it line-by-line you can still preform the mysql insertion, then just remove the file. 如果仅打开初始数据文件并逐行读取,您仍然可以执行mysql插入操作,然后只需删除该文件即可。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM