简体   繁体   English

Memory 在逐行追加 2 个 csv 文件时出现问题

[英]Memory issues when row-wise appending 2 csv files

I have a larger csv file (about 550 mb) and a smaller csv file (about 5mb) and I want to combine all the rows into one csv file.我有一个较大的 csv 文件(约 550 mb)和一个较小的 csv 文件(约 5mb),我想将所有行合并到一个 csv 文件中。 They both have the same header (same order, values, number of columns) and obviously the larger file has more rows.它们都具有相同的 header(相同的顺序、值、列数),并且显然较大的文件有更多的行。 I'm using 32-bit Python (can't change it) and I'm having issues appending the csv's.我正在使用 32 位 Python (无法更改)并且在附加 csv 时遇到问题。 It seems that the top answer and the next answer after the top answer works here: How do I combine large csv files in python?似乎最佳答案和最佳答案之后的下一个答案在这里有效: 如何在 python 中合并大型 csv 文件? . . However, this takes an ungodly amount of time and I am looking for ways to expedite the process.然而,这需要大量的时间,我正在寻找加快这个过程的方法。 Also, when I stop running the code in the second answer for the linked question (since it takes so long to run), the first row in the resulting csv is always empty.此外,当我停止运行链接问题的第二个答案中的代码时(因为运行需要很长时间),生成的 csv 中的第一行始终为空。 I guess when you call pd.to_csv(..., mode='a', ...), it appends below the first row of the csv.我猜当您调用 pd.to_csv(..., mode='a', ...) 时,它会附加在 csv 的第一行下方。 How do you ensure the first row is populated?你如何确保第一行被填充?

This is much simpler in Linux command line, and won't need to load the file into memory这在 Linux 命令行中要简单得多,并且不需要将文件加载到 memory

Use the tail command, the +2 is the number of lines to skip.使用tail命令,+2是要跳过的行数。 Often for me, because of how the files are formatted I need +2 instead of +1:通常对我来说,由于文件的格式,我需要 +2 而不是 +1:

tail -n +2 small.csv >> giant.csv

This should do the trick.这应该可以解决问题。

If you need to do it in python then, something like append mode might work but will need to load into memory.如果您需要在 python 中执行此操作,则类似 append 模式可能有效,但需要加载到 memory 中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM