将大 csv 文件拆分为多个文件，并在每个部分中保留 header

Question

How to split a large csv file (1GB) into multiple files (say one part with 1000 rows, 2nd part 10000 rows, 3rd part 100000, etc) and preserve the header in each part?如何将一个大的 csv 文件 (1GB) 拆分为多个文件（比如一个部分有 1000 行，第二部分有 10000 行，第三部分有 100000 行等）并在每个部分中保留 header？

How can I achieve this我怎样才能做到这一点

h1 h2
a  aa
b  bb
c  cc
.
.
12483720 rows

into进入

h1 h2
a  aa
b  bb
.
.
.
1000 rows

And和

h1 h2
x  xx
y  yy
.
.
.
10000 rows

Answer 1

Here is a small adaptation of the solution from: Split CSV files into smaller files but keeping the headers?这是对解决方案的一小部分改编：将CSV 文件拆分为较小的文件但保留标题？

awk -v l=1000  '(NR==1){header=$0;next}
                (n==l) { 
                   c=sprintf("%0.5d",c+1); 
                   close(file); file=FILENAME; sub(/csv$/,c".csv",file)
                   print header > file
                   n=0;l*=10
                }
                {print $0 > file; n++}' file.csv

This works in the following way:这以下列方式工作：

(NR==1){header=$0;next} : If the record/line is the first line, save that line as the header . (NR==1){header=$0;next} ：如果记录/行是第一行，则将该行保存为标题。
(n==l){...} : Every time we wrote the requested amount of records/lines, we need to start writing to a new file. (n==l){...} ：每次我们写入请求的记录/行数时，我们都需要开始写入一个新文件。 This happens every time n==l and we perform the following actions:每次n==l时都会发生这种情况，我们执行以下操作：
- c=sprintf("%0.5d",c+1) : increase the counter with one, and print it as 000xx c=sprintf("%0.5d",c+1) ：将计数器加一，并将其打印为000xx
- close(file) : close the file you just wrote too. close(file) :关闭你刚刚写的文件。
- file=FILENAME; sub(/csv$/,c".csv",file) file=FILENAME; sub(/csv$/,c".csv",file) : define the new filename file=FILENAME; sub(/csv$/,c".csv",file) :定义新文件名
- print header > file : open the file and write the header to that file. print header > file ：打开文件并将头写入该文件。
- n=0 : reset the current record count n=0 :重置当前记录数
- l*=10: increase the maximum record count for the next file l*=10:增加下一个文件的最大记录数
{print $0 > file; n++} {print $0 > file; n++} : write the entries to the file and increment the record count {print $0 > file; n++} ：将条目写入文件并增加记录数

Answer 2

Another awk.另一个awk。 First some test records:先来一些测试记录：

$ seq 1 1234567 > file

Then the awk:然后是 awk：

$ awk 'NR==1{n=1000;h=$0}{print > n}NR==n+c{n*=10;c=NR-1;print h>n}' file

Explained:解释：

$ awk '
NR==1 {           # first record:
    n=1000        # set first output file size and
    h=$0          # store the header
}
{
    print > n     # output to file
}
NR==n+c {         # once target NR has been reached. close(n) goes here if needed
    n*=10         # grow target magnitude
    c=NR-1        # set the correction factor. 
    print h > n   # first the head
}' file

Count the records:计算记录：

$ wc -l 1000*
   1000 1000
  10000 10000
 100000 100000
1000000 1000000
 123571 10000000
1234571 total

Answer 3

Hacky, but utlizes the split utility, which does most of the heavy lifting for splitting the files. Hacky，但使用了split实用程序，它完成了拆分文件的大部分繁重工作。 Then, with the split files with a well-defined naming convention, I loop over files without the header, and spit out a file with the header concatenated with the file body to tmp.txt , and then move that file back to the original filename.然后，使用具有明确定义的命名约定的拆分文件，我遍历没有 header 的文件，并吐出一个带有 header 的文件与文件主体连接到tmp.txt ，然后将该文件移回原始文件名.

# Use `split` utility to split the file csv, with 5000 lines per files,
# adding numerical suffixs, and adding additional suffix '.split' to help id
# files.
split -l 5000 -d --additional-suffix=.split repro-driver-table.csv

# This identifies all files that should NOT have headers
# ls -1 *.split | egrep -v -e 'x0+\.split'

# This identifies files that do have headers
# ls -1 *.split | egrep -e 'x0+\.split'

# Walk the files that do not have headers.  For each one, cat the header from
# file with header, with rest of body, output to tmp.txt, then mv tmp.txt to
# original filename.
for f in $(ls -1 *.split | egrep -v -e 'x0+\.split'); do
    cat <(head -1 $(ls -1 *.split | egrep -e 'x0+\.split')) $f > tmp.txt
    mv tmp.txt $f
done

Answer 4

Here's a first approach:这是第一种方法：

#!/bin/bash
head -1 $1 >header
split $1 y
for f in y*; do
    cp header h$f
    cat $f >>h$f
done
rm -f header
rm -f y*

Answer 5

The following bash solution should work nicely :以下 bash 解决方案应该可以很好地工作：

IFS='' read -r header
for ((curr_file_max_rows=1000; 1; curr_file_max_rows*=10)) {
    curr_file_name="file_with_${curr_file_max_rows}_rows"
    echo "$header" > "$curr_file_name"
    for ((curr_file_row_count=0; curr_file_row_count < curr_file_max_rows; curr_file_row_count++)) {
        IFS='' read -r row || break 2
        echo "$row" >> "$curr_file_name"
    }
}

We have a first iteration level which produces the number of rows we're going to write for each successive file.我们有第一个迭代级别，它生成我们要为每个连续文件写入的行数。 It generates the file names and write the header to them.它生成文件名并将标题写入它们。 It is an infinite loop because we don't check how many lines the input has and therefore don't know beforehand how many files we're going to write to, so we'll have to break out of this loop to end it.这是一个无限循环，因为我们不检查多少行输入有，因此不知道我们事先有多少文件去写，所以我们必须要break这个循环结束它出来。

Inside this loop we iterate a second time, this time over the number of lines we're going to write to the current file.在这个循环中，我们第二次迭代，这次是我们要写入当前文件的行数。 In this loop we try to read a line from the input file.在这个循环中，我们尝试从输入文件中读取一行。 If it works we write it to the current output file, if it doesn't (we've reached the end of the input) we break out of two levels of loop.如果可行，我们将其写入当前输出文件，如果不可行（我们已到达输入的末尾），我们break跳出两级循环。

You can try it here .你可以在这里试试。

将大 csv 文件拆分为多个文件，并在每个部分中保留 header

问题描述

5 个解决方案

解决方案1
1 2020-03-10 14:06:53

解决方案2
1 已采纳 2020-03-11 06:52:07

解决方案3
0 2022-03-11 18:54:38

解决方案4
-1 2020-03-10 10:31:43

解决方案5
-1 2020-03-10 13:33:33

将大 csv 文件拆分为多个文件，并在每个部分中保留 header

问题描述

5 个解决方案

解决方案1 1 2020-03-10 14:06:53

解决方案2 1 已采纳 2020-03-11 06:52:07

解决方案3 0 2022-03-11 18:54:38

解决方案4 -1 2020-03-10 10:31:43

解决方案5 -1 2020-03-10 13:33:33

解决方案1
1 2020-03-10 14:06:53

解决方案2
1 已采纳 2020-03-11 06:52:07

解决方案3
0 2022-03-11 18:54:38

解决方案4
-1 2020-03-10 10:31:43

解决方案5
-1 2020-03-10 13:33:33