使用 Miller 按列拆分巨大的 CSV

Question

I need to split huge (>1 Gb) CSV files containing 50K+ columns each on a daily basis.我需要每天拆分包含 50K+ 列的巨大 (>1 Gb) CSV 文件。

I've found Miller as an interesting and performant tool for such a task.我发现Miller是完成此类任务的有趣且高效的工具。

But I'm stuck on Miller's documentation.但我坚持使用 Miller 的文档。

How could I split one CSV to N smaller CSV files where N is a number of rows in my source file?我如何将一个 CSV 拆分为N个较小的 CSV 文件，其中N是源文件中的行数？

Answer 1

try with this script试试这个脚本

mlr --csv put -S 'if (NR % 10000 == 0) {$rule=NR} else {$rule = ""}' \
then fill-down -f rule \
then put -S 'if ($rule=="") {$rule="0"}' \
then put -q 'tee > $rule.".csv", $*' input.csv

Make a copy of your CSV in a new folder, and then run this script on it.在新文件夹中复制您的 CSV，然后在其上运行此脚本。 It will produce a csv file for every 10000 rows.它将为每 10000 行生成一个 csv 文件。

Answer 2

the answer from aborruso does add a new column rule to the output csv files. aborruso 的回答确实向输出 csv 文件添加了一个新的列rule 。 If you want to avoid this, use emit with mapexcept instead of tee in the last step, like this:如果您想避免这种情况，请在最后一步中使用带有mapexcept的emit而不是tee ，如下所示：

mlr --csv put -S 'if (NR % 10000 == 0) {$rule=NR} else {$rule = ""}' \
then fill-down -f rule \
then put -S 'if ($rule=="") {$rule="0"}' \
then put -q 'emit > $rule.".csv", mapexcept($*, "rule")' input.csv

使用 Miller 按列拆分巨大的 CSV

问题描述

2 个解决方案

解决方案1
6 已采纳 2019-04-15 10:24:22

解决方案2
3 2021-06-09 14:51:19

使用 Miller 按列拆分巨大的 CSV

问题描述

2 个解决方案

解决方案1 6 已采纳 2019-04-15 10:24:22

解决方案2 3 2021-06-09 14:51:19

解决方案1
6 已采纳 2019-04-15 10:24:22

解决方案2
3 2021-06-09 14:51:19