简体   繁体   English

根据列将大 csv 文件拆分为多个文件

[英]Split large csv file into multiple files based on column(s)

I would like to know of a fast/efficient way in any program (awk/perl/python) to split a csv file (say 10k columns) into multiple small files each containing 2 columns.我想知道在任何程序(awk/perl/python)中将 csv 文件(比如 10k 列)拆分为多个小文件的快速/有效方法,每个文件包含 2 列。 I would be doing this on a unix machine.我将在 unix 机器上执行此操作。

#contents of large_file.csv
1,2,3,4,5,6,7,8
a,b,c,d,e,f,g,h
q,w,e,r,t,y,u,i
a,s,d,f,g,h,j,k
z,x,c,v,b,n,m,z

I now want multiple files like this:我现在想要多个这样的文件:

# contents of 1.csv
1,2
a,b
q,w
a,s
z,x

# contents of 2.csv
1,3
a,c
q,e
a,d
z,c

# contents of 3.csv
1,4
a,d
q,r
a,f
z,v

and so on...

I can do this currently with awk on small files (say 30 columns) like this:我目前可以使用 awk 对小文件(比如 30 列)执行此操作,如下所示:

awk -F, 'BEGIN{OFS=",";} {for (i=1; i < NF; i++) print $1, $(i+1) > i ".csv"}' large_file.csv

The above takes a very long time with large files and I was wondering if there is a faster and more efficient way of doing the same.上面的大文件需要很长时间,我想知道是否有更快更有效的方法来做同样的事情。

Thanks in advance.提前致谢。

I needed the same functionality and wrote it in bash. Not sure if it will be faster than ravindersingh13 's answer, but I hope it will help someone.我需要相同的功能并将其写在 bash 中。不确定它是否会比ravindersingh13的答案更快,但我希望它能帮助别人。

Actual version: https://github.com/pgrabarczyk/csv-file-splitter实际版本: https://github.com/pgrabarczyk/csv-file-splitter

#!/usr/bin/env bash
set -eu

SOURCE_CSV_PATH="${1}"
LINES_PER_FILE="${2}"
DEST_PREFIX_NAME="${3}"
DEBUG="${4:-0}"

split_files() {
  local source_csv_path="${1}"
  local lines_per_file="${2}"
  local dest_prefix_name="${3}"
  local debug="${4}"

  _print_log "source_csv_path: ${source_csv_path}"
  local dest_prefix_path="$(pwd)/output/${dest_prefix_name}"
  _print_log "dest_prefix_path: ${dest_prefix_path}"

  local headline=$(awk "NR==1" "${source_csv_path}")
  local file_no=0
  
  mkdir -p "$(dirname ${dest_prefix_path})"

  local lines_in_files=$(wc -l "${source_csv_path}" | awk '{print $1}')
  local files_to_create=$(((lines_in_files-1)/lines_per_file))
  _print_log "There is ${lines_in_files} lines in file. I will create ${files_to_create} files per ${lines_per_file} (Last file may have less)"

  _print_log "Start processing."

  for (( start_line=1; start_line<=lines_in_files; )); do
    last_line=$((start_line+lines_per_file))
    file_no=$((file_no+1))
    local file_path="${dest_prefix_path}$(printf "%06d" ${file_no}).csv"

    if [ $debug -eq 1 ]; then
      _print_log "Creating file ${file_path} with lines [${start_line};${last_line}]"
    fi

    echo "${headline}" > "${file_path}"
    awk "NR>${start_line} && NR<=${last_line}" "${source_csv_path}" >> "${file_path}"

    start_line=$last_line
  done

  _print_log "Done."
}

_print_log() {
  local log_message="${1}"
  local date_time=$(date "+%Y-%m-%d %H:%M:%S.%3N")
  printf "%s - %s\n" "${date_time}" "${log_message}" >&2
}

split_files "${SOURCE_CSV_PATH}" "${LINES_PER_FILE}" "${DEST_PREFIX_NAME}" "${DEBUG}"

Execution:执行:

bash csv-file-splitter.sh "sample.csv" 3 "result_" 1

With your show samples, attempts;带着你的展示样品,尝试; please try following awk code.请尝试遵循awk代码。 Since you are opening files all together it may fail with infamous "too many files opened error" So to avoid that have all values into an array and in END block of this awk code print them one by one and I am closing them ASAP all contents are getting printed to output file.由于您同时打开文件,因此可能会因臭名昭著的“打开的文件过多错误”而失败,因此为避免将所有值放入数组中,并在此awk代码的END块中一一打印它们,我将尽快关闭它们的所有内容正在打印到 output 文件。

awk '
BEGIN{ FS=OFS="," }
{
  for(i=1;i<NF;i++){
    value[i]=(value[i]?value[i] ORS:"") ($1 OFS $(i+1))
  }
}
END{
  for(i=1;i<=NF;i++){
    outFile=i".csv"
    print value[i] > (outFile)
    close(outFile)
  }
}
' large_file.csv

The main hold up here is in writing so many files.这里的主要障碍是编写这么多文件。

Here is one approach这是一种方法

use warnings;
use strict;
use feature 'say';
    
my $file = shift // die "Usage: $0 csv-file\n";

my @lines = do { local @ARGV = $file; <> };
chomp @lines;

my @fhs = map { 
    open my $fh, '>', "f${_}.csv" or die $!; 
    $fh 
} 
1 .. scalar( split /,/, $lines[0] );

foreach my $line (@lines) { 
    my ($first, @cols) = split /,/, $line; 
    say {$fhs[$_]} join(',', $first, $cols[$_]) 
        for 0..$#cols;
}

I didn't time this against any other approaches.我没有针对任何其他方法计时。 Assembling data for each file first and then dumping it in one operation into each file may help, but first let us know how large the original CSV file is.首先为每个文件组装数据,然后在一次操作中将其转储到每个文件中可能会有所帮助,但首先让我们知道原始 CSV 文件有多大。

Opening so many output files at once (for @fhs filehandles) may pose problems.一次打开这么多 output 文件(对于@fhs文件句柄)可能会造成问题。 If that is the case then the simplest way is to first assemble all data and then open and write a file at a time如果是这种情况,那么最简单的方法是首先组装所有数据,然后一次打开并写入一个文件

use warnings;
use strict;
use feature 'say';

my $file = shift // die "Usage: $0 csv-file\n";

open my $fh, '<', $file or die "Can't open $file: $!";

my @data;
while (<$fh>) {
    chomp;
    my ($first, @cols) = split /,/;
    push @{$data[$_]}, join(',', $first, $cols[$_]) 
        for 0..$#cols;
}

for my $i (0..$#data) {
    open my $fh, '>', $i+1 . '.csv' or die $!;
    say $fh $_ for @{$data[$i]};
}

This depends on whether the entire original CSV file, plus a bit more, can be held in memory.这取决于整个原始 CSV 文件,再加上一点,是否可以保存在 memory 中。

Tried a solution using the module Text::CSV.尝试使用模块 Text::CSV 的解决方案。

#! /usr/bin/env perl

use warnings;
use strict;
use utf8;
use open qw<:std :encoding(utf-8)>;
use autodie;
use feature qw<say>;
use Text::CSV;

my %hsh = ();

my $csv = Text::CSV->new({ sep_char => ',' });

print "Enter filename: ";
chomp(my $filename = <STDIN>);

open (my $ifile, '<', $filename);

while (<$ifile>) {
    chomp;
    if ($csv->parse($_)) {
    
    my @fields = $csv->fields();
    my $first = shift @fields;
    while (my ($i, $v) = each @fields) {
        push @{$hsh{($i + 1).".csv"}}, "$first,$v";   
    }   
    } else {
    die "Line could not be parsed: $_\n";
    }
}

close($ifile);

while (my ($k, $v) = each %hsh) {
    open(my $ifile, '>', $k);
    say {$ifile} $_ for @$v;
    close($ifile);
}

exit(0);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM