简体   繁体   English

有效地联接多个CSV文件,使标头与C#中的第一个文件保持一致

[英]Efficiently join multiple CSV files keeping the header from first file in C#

Given multiple CSV files, that can be hundreds of megabytes or more per file. 给定多个CSV文件,每个文件可以达到数百兆字节或更多。 They all have the same header row starting the file and have CRLF at the end of each line. 它们都在文件的开头具有相同的标题行,并且每行的末尾都有CRLF。 Each file may or may not have a CRLF at the end of the file. 每个文件的末尾可能有CRLF,也可能没有CRLF。 The goal is to: 目标是:

  1. Join a list of files. 加入文件列表。
  2. Keep the header from the first file. 保留第一个文件的标题。
  3. Output them to a new file. 将它们输出到新文件。
  4. These files may have thousands of columns and millions of rows. 这些文件可能具有数千列和数百万行。
  5. The files must be processed in the order given, and order of the rows is significant. 必须按照给定的顺序处理文件,并且行的顺序很重要。

Given the size of the files, this needs to be as fast and memory efficient as possible. 在给定文件大小的情况下,这需要尽可能快并且内存效率更高。

If the headers are the same, then you can just open a write stream, then go through all the input files, opening read streams for them and copying data. 如果标题相同,则可以打开一个写流,然后遍历所有输入文件,为它们打开读取流并复制数据。 The first file is copied in its entirety. 第一个文件被完整复制。 Subsequent files have the first line skipped. 后续文件的第一行被跳过。

That approach would be the fastest, so long as you are 100% sure the columns align and it's only the first line that needs skipping. 只要您100%确定列对齐并且仅是第一行需要跳过,该方法将是最快的。

This kind of thing would be quite straightforward to do on a Unix-style command line, btw. 这种事情在Unix风格的命令行btw上非常简单。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM