使用Laravel在MySQL中导入大型CSV文件

Question

I have a csv file that can range from 50k to over 100k rows of data. 我有一个csv文件，范围从50k到超过100k行数据。

I'm currently using Laravel w/ Laravel Forge, MySQL, and Maatwebsite Laravel Excel package. 我目前正在使用Laravel w / Laravel Forge，MySQL和Maatwebsite Laravel Excel软件包。

This is to be used by an end-user and not myself so I have created a simple form on my blade view as such: 这是由最终用户而不是我自己使用，所以我在我的刀片视图上创建了一个简单的表单：

{!! Form::open(
    array(
        'route' => 'import.store', 
        'class' => 'form',
        'id' => 'upload',
        'novalidate' => 'novalidate', 
        'files' => true)) !!}

    <div class="form-group">
        <h3>CSV Product Import</h3>
        {!! Form::file('upload_file', null, array('class' => 'file')) !!}
    </div>

    <div class="form-group">
        {!! Form::submit('Upload Products', array('class' => 'btn btn-success')) !!}
    </div>
{!! Form::close() !!}

This then stores the file on the server successfully and I'm now able to iterate through the results using something such as a foreach loop. 然后，这将成功地将文件存储在服务器上，现在我可以使用诸如foreach循环之类的东西来迭代结果。

Now here are the issues I'm facing in chronological order and fixes/attempts: (10k rows test csv file) 现在这里是我按时间顺序和修复/尝试面临的问题：（10k行测试csv文件）

[issue] PHP times out. [问题] PHP超时。
[remedy] Changed it to run asynchronously via a job command. [remedy]将其更改为通过作业命令异步运行。
[result] Imports up to 1500 rows. [result]最多可输入1500行。
[issue] Server runs out of memory. [问题]服务器内存不足。
[remedy] Added a swap drive of 1gb. [补救措施]添加了1GB的交换驱动器。
[result] Imports up to 3000 rows. [结果]最多可输入3000行。
[issue] Server runs out of memory. [问题]服务器内存不足。
[remedy] Turned on chunking results of 250 rows each chunk. [补救]打开每块250行的分块结果。
[result] Imports up to 5000 rows. [result]最多可输入5000行。
[issue] Server runs out of memory. [问题]服务器内存不足。
[remedy] Removed some tranposing/joined tables logic. [补救]删除了一些转置/连接表逻辑。
[result] Imports up to 7000 rows. [result]最多可输入7000行。

As you can see the results are marginal and nowhere near 50k, I can barely even make it near 10k. 正如你所看到的结果是边缘的，远不及50k，我甚至几乎不能接近10k。

I've read up and looked into possible suggestions such as: 我已经阅读并研究了可能的建议，例如：

Use a raw query to run Load Data Local Infile. 使用原始查询运行“加载数据本地Infile”。
Split files before importing. 导入前拆分文件。
Store on server then have server split into files and have a cron process them. 存储在服务器上然后将服务器拆分成文件并让cron处理它们。
Upgrade my 512mb DO droplet to 1gb as a last resort. 作为最后的手段，将我的512mb DO液滴升级到1gb。

Going with load data local infile may not work because my header columns could change per file that's why I have logic to process/iterate through them. 使用加载数据本地infile可能无法工作，因为我的标题列可能会更改每个文件，这就是为什么我有逻辑来处理/迭代它们。

Splitting files before importing is fine under 10k but for 50k or more? 在导入之前拆分文件在10k以下是好的但是对于50k或更多？ That would be highly impractical. 那将是非常不切实际的。

Store on server and then have the server split it and run them individually without troubling the end-user? 存储在服务器上然后让服务器拆分并单独运行它们而不会给最终用户带来麻烦？ Possibly but not even sure how to achieve this in PHP yet just only briefly read about that. 可能但甚至不确定如何在PHP中实现这一点，但只是简单地了解一下。

Also to note, my queue worker is set to timeout in 10000 seconds which is also very impractical and bad-practice but seems that was the only way it will keep running before memory takes a hit. 还要注意，我的队列工作程序设置为10000秒超时，这也是非常不切实际和糟糕的做法，但似乎这是它在内存受到打击之前继续运行的唯一方式。

Now I can give-in and just upgrade the memory to 1gb but I feel at best it may jump me to 20k rows before it fails again. 现在我可以放弃并将内存升级到1gb，但我觉得它最好可能会在重新失败之前将我跳到20k行。 Something needs to process all these rows quickly and efficiently. 有些东西需要快速有效地处理所有这些行。

Lastly here is a glimpse of my table structure: 最后这里是我的表结构的一瞥：

Inventory
+----+------------+-------------+-------+---------+
| id | profile_id | category_id |  sku  |  title  |
+----+------------+-------------+-------+---------+
|  1 |         50 |       51234 | mysku | mytitle |
+----+------------+-------------+-------+---------+

Profile
+----+---------------+
| id |     name      |
+----+---------------+
| 50 | myprofilename |
+----+---------------+

Category
+----+------------+--------+
| id | categoryId |  name  |
+----+------------+--------+
|  1 |      51234 | brakes |
+----+------------+--------+

Specifics
+----+---------------------+------------+-------+
| id | specificsCategoryId | categoryId | name  |
+----+---------------------+------------+-------+
|  1 |                  20 |      57357 | make  |
|  2 |                  20 |      57357 | model |
|  3 |                  20 |      57357 | year  |
+----+---------------------+------------+-------+

SpecificsValues
+----+-------------+-------+--------+
| id | inventoryId | name  | value  |
+----+-------------+-------+--------+
|  1 |           1 | make  | honda  |
|  2 |           1 | model | accord |
|  3 |           1 | year  | 1998   |
+----+-------------+-------+--------+

Full CSV Sample
+----+------------+-------------+-------+---------+-------+--------+------+
| id | profile_id | category_id |  sku  |  title  | make  | model  | year |
+----+------------+-------------+-------+---------+-------+--------+------+
|  1 |         50 |       51234 | mysku | mytitle | honda | accord | 1998 |
+----+------------+-------------+-------+---------+-------+--------+------+

So a quick run-through of my logic workflow as simple as possible would be: 因此，尽可能简单地快速浏览我的逻辑工作流程：

Load file into Maatwebsite/Laravel-Excel and iterate through a chunked loop 将文件加载到Maatwebsite / Laravel-Excel中并遍历一个分块循环
Check if category_id and sku are empty else ignore and log error to an array. 检查category_id和sku是否为空，否则忽略并将错误记录到数组中。
Lookup category_id and pull all relevant column fields from all related tables it uses and then if no null insert into the database. 查找category_id并从它使用的所有相关表中提取所有相关列字段，然后如果没有空插入数据库。
Generate a custom title using more logic using the fields available on the file. 使用文件中可用的字段使用更多逻辑生成自定义标题。
Rinse and repeat. 冲洗并重复。
Lastly export the errors array into a file and log it into a database for download to view errors at the end. 最后，将errors数组导出到一个文件中，并将其记录到数据库中进行下载，以便最后查看错误。

I hope someone can share with me some insight on some possible ideas on how I should tackle this while keeping in mind of using Laravel and also that it's not a simple upload I need to process and put into different related tables per line else I'd load data infile it all at once. 我希望有人可以与我分享一些关于我应该如何解决这个问题的一些可能的想法，同时牢记使用Laravel，并且这不是一个简单的上传，我需要处理并放入不同的相关表格，否则我会加载数据一次性输入所有内容。

Thanks! 谢谢！

Answer 1

You seem to have already figured out the logic for interpreting the CSV lines and converting them to insert queries on the database, so I will focus on the memory exhaustion issue. 您似乎已经找到了解释CSV行并将其转换为在数据库上插入查询的逻辑，因此我将重点关注内存耗尽问题。

When working with large files in PHP, any approach that loads the entire file to memory will either fail, became unbearably slow or require a lot more RAM than you Droplet has. 在PHP中处理大型文件时，任何将整个文件加载到内存的方法都会失败，变得难以忍受地慢或者需要比Droplet更多的RAM。

So my advices are: 所以我的建议是：

Read the file line by line using fgetcsv 使用fgetcsv逐行读取文件

$handle = fopen('file.csv', 'r');
if ($handle) {
    while ($line = fgetcsv($handle)) {
        // Process this line and save to database
    }
}

This way only one row at a time will be loaded to memory. 这样一次只能将一行加载到内存中。 Then, you can process it, save to the database, and overwrite it with the next one. 然后，您可以处理它，保存到数据库，并用下一个覆盖它。

Keep a separate file handle for logging 为记录保留单独的文件句柄

Your server is short on memory, so logging errors to an array may not be a good idea as all errors will be kept in it . 您的服务器内存不足，因此将错误记录到阵列可能不是一个好主意，因为所有错误都将保留在其中 。 That can become a problem if your csv has lots of entries with empty skus and category ids. 如果您的csv有大量带有空skus和类别ID的条目，那么这可能会成为一个问题。

Laravel comes out of the box with Monolog and you can try to adapt it to your needs. Laravel开箱即用Monolog ，您可以尝试根据您的需求进行调整。 However, if it also ends up using too much resources or not fitting your needs, a simpler approach may be the solution. 但是，如果它最终也使用了太多资源或者不能满足您的需求，那么更简单的方法可能就是解决方案。

$log = fopen('log.txt', 'w');
if (some_condition) {
    fwrite($log, $text . PHP_EOL);
}

Then, at the end of the script you can store the log file wherever you want. 然后，在脚本的末尾，您可以将日志文件存储在任何位置。

Disable Laravel's query log 禁用Laravel的查询日志

Laravel keeps all your queries stored in memory, and that's likely to be a problem for your application. Laravel会将您的所有查询保存在内存中，这可能会对您的应用程序造成问题。 Luckily, you can use the disableQueryLog method to free some precious RAM. 幸运的是，您可以使用disableQueryLog方法释放一些宝贵的RAM。

DB::connection()->disableQueryLog();

Use raw queries if needed 如果需要，使用原始查询

I think it's unlikely that you will run out of memory again if you follow these tips, but you can always sacrifice some of Laravel's convenience to extract that last drop of performance. 如果您遵循这些提示，我认为您不太可能再次耗尽内存，但您总是可以牺牲Laravel的一些便利来提取最后一滴性能。

If you know your way around SQL, you can execute raw queries to the database . 如果您了解SQL的方法，则可以对数据库执行原始查询。

Edit: 编辑：

As for the timeout issue, you should be running this code as a queued task as suggested in the comments regardless. 至于超时问题，您应该按照注释中的建议将此代码作为排队任务运行，无论如何。 Inserting that many rows WILL take some time (specially if you have lots of indexes) and the user shouldn't be staring at an unresponsive page for that long. 插入那么多行将花费一些时间（特别是如果你有很多索引）并且用户不应该长时间盯着无响应的页面。

使用Laravel在MySQL中导入大型CSV文件

问题描述

1 个解决方案

解决方案1
7 已采纳 2015-12-29 03:19:16

使用Laravel在MySQL中导入大型CSV文件

问题描述

1 个解决方案

解决方案1 7 已采纳 2015-12-29 03:19:16

解决方案1
7 已采纳 2015-12-29 03:19:16