简体   繁体   English

如何为大型CSV文件数据提取优化循环

[英]How to optimize loops for large CSV files data extraction

I have a question about code optimization. 我对代码优化有疑问。 I haven't coded anything besides simple loops in over ten years. 十多年来,除了简单的循环外,我没有编码。

I created the code below, which works fine but is super slow for my needs. 我在下面创建了代码,该代码可以正常运行,但对于我的需求而言却非常慢。

In essence, I have 2 CSV files: 本质上,我有2个CSV文件:

  • a source CSV file that has about 500 000 records, let's say: att1, att2, source_id, att3, att4 (in reality there are about 40 columns) 一个源CSV文件,该文件大约有50万条记录,比方说:att1,att2,source_id,att3,att4(实际上大约有40列)
  • a main CSV file that has about 120 million records, let's say: att1, att2, att3, main_id, att4 (in reality there are about 120 columns) 一个主要的CSV文件,大约有1.2亿条记录,比方说:att1,att2,att3,main_id,att4(实际上大约有120列)

For each source_id in the source file, my code parses the main file for all the lines where main_ id == source_id and it writes each of those lines in a new file. 对于源文件中的每个source_id,我的代码都会分析main文件中main_ id == source_id的所有行,并将这些行中的每一行写入新文件。

Do you have any suggestion on how I could optimize the code, to go much much faster? 您对我如何优化代码有什么建议吗?

<?php

$mf = "main.csv";
$mf_max_line_length = "512";
$mf_id = "main_id";

$sf = "source.csv";
$sf_max_line_length = "884167";
$sf_id = "source_id";


if (($mf_handle = fopen($mf, "r")) !== FALSE)
{
    // Read the first line of the main CSV file
    // and look for the position of main_id
    $mf_data = fgetcsv($mf_handle, $mf_max_line_length, ",");
    $mf_id_pos = array_search ($mf_id, $mf_data);

    // Create a new main CSV file
    if (($nmf_handle = fopen("new_main.csv", "x")) !== FALSE)
    {
        fputcsv($nmf_handle,$mf_data);
    } else {
        echo "Cannot create file: new_main.csv" . $sf;
        break;
    }
}

// Open the source CSV file
if (($sf_handle = fopen($sf, "r")) !== FALSE)
{
    // Read the first line of the source CSV file
    // and look for the position of source_id
    $sf_data = fgetcsv($sf_handle, $sf_max_line_length, ",");
    $sf_id_pos = array_search ($sf_id, $sf_data);

    // Go trhough the whole source CSV file
    while (($sf_data = fgetcsv($sf_handle, $sf_max_line_length, ",")) !== FALSE)
    {
        // Open the main CSV file
        if (($mf_handle = fopen($mf, "r")) !== FALSE)
        {
            // Go trhough the whole main CSV file
            while (($mf_data = fgetcsv($mf_handle, $mf_max_line_length, ",")) !== FALSE)
            {
                // If the source_id matches the main_id
                // then we write it into the new_main CSV file
                if ($mf_data[$mf_id_pos] == $sf_data[$sf_id_pos])
                {
                    fputcsv($nmf_handle,$mf_data);
                }
            }
            fclose($mf_handle);
        }
    }
    fclose($sf_handle);
    fclose($nmf_handle);
}

?>

Sounds like a job for mysql. 听起来像是mysql的工作。

First, you'll need to create tables based on all your fields. 首先,您需要基于所有字段创建表。 See here 看这里

Then, you'll load your data. 然后,您将加载数据。 See here 看这里

Finally, you'll create a query like: 最后,您将创建一个查询,例如:

SELECT * INTO OUTFILE '/tmp/something.csv' 
    FIELDS TERMINATED BY ',' ENCLOSED BY '"' 
    LINES TERMINATED BY '\n' 
FROM source_table INNER JOIN main_table ON 
    source_table.source_id=main_table.main_id;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM