将大文件拆分为 n 个文件，保留前 7 列 + 后 3 列，直到第 n 列

Question

我有一个带有列名的巨大数据框：

A,B,C,D,F,G,H,GT_a,N_a_,E_a,GT_b,N_b_,E_b,GT_c,N_c_,E_c,...,GT_n,N_n,E_n

使用 unix/bash 或 python，我想生成 n 个具有以下列的单个文件：

A,B,C,D,F,G,H,GT_a,N_a_,E_a

A,B,C,D,F,G,H,GT_b,N_b_,E_b

A,B,C,D,F,G,H,GT_c,N_c_,E_c

....

A,B,C,D,F,G,H,GT_n,N_n_,E_n

每个文件都应该被调用：a.txt, b.txt, c.txt,...,n.txt

Answer 1

import pandas as pd
import numpy as np

c = "A,B,C,D,F,G,H,GT_a,N_a_,E_a,GT_b,N_b_,E_b,GT_c,N_c_,E_c,GT_d,N_d_,E_d,GT_e,N_e_,E_e".split(',')
df = pd.DataFrame(np.full((30, 22), c), columns=c)

c = None
c = list(df.columns)
default = c[:7]
var = np.matrix(c[7:])
var = pd.DataFrame(var.reshape(var.shape[1]//3, 3))

def dump(row):
    cols = default + list(row)
    magic = cols[-1][-1]
    df[cols].to_csv(magic + '.txt')

var.apply(dump, axis=1)

Answer 2

这应该写出不同的文件，每个文件都有不同的标题。 您必须将COL_NAMES_TO_WRITE更改为您想要的。

它使用标准库，所以没有 pandas。 它不会写出超过 26 个不同的文件。但文件名生成器可以更改以增加并允许它。

如果我正确地解释了这个问题，你想把它分成 14 个文件（a..n）

您必须将以下代码复制到文件splitter.py中，然后运行以下命令： python3.8 splitter.py --fn largefile.txt -n 14

largefile.txt是您需要拆分的大文件。

import argparse
import csv
import string

COL_NAMES_TO_WRITE = "A,B,C,D,F,G,H,GT_{letter},N_{letter},E_{letter}"
WRITTEN_HEADERS = set()  # place to keep track of whether headers have been written

def output_file_generator(num):
    if num > 26: raise ValueError(f"Can only print out 26 different files, not {num}")

    i = 0
    while True:
        prefix = string.ascii_lowercase[i]
        i = (i + 1) % num  # increment modulo number of files we want
        yield f"{prefix}.txt"

def col_name_generator(num):
    i = 0
    while True:
        col_suffix = string.ascii_lowercase[i]
        i = (i + 1) % num  # increment modulo number of files we want
        print( COL_NAMES_TO_WRITE.format(letter=col_suffix).split(','))
        yield COL_NAMES_TO_WRITE.format(letter=col_suffix).split(',')

def main(filename, num_files=4):
    """Split a file into multiple files

    Args:
        filename (str): large filename that needs to be split into multiple files
        num_files (int): number of files to split filename into
    """
    print(filename)
    with open(filename, 'r') as large_file_fp:
        reader = csv.DictReader(large_file_fp)
        output_files =  output_file_generator(num_files)
        col_names = col_name_generator(num_files)
        for line in reader:
            print(line)
            filename_for_this_file = output_files.__next__()
            print("filename ", filename_for_this_file)
            column_names_for_this_file = col_names.__next__()
            print("col names:", column_names_for_this_file)

            with open(filename_for_this_file, 'a') as output_fp:

                writer = csv.DictWriter(output_fp, fieldnames=column_names_for_this_file)
                if filename_for_this_file not in  WRITTEN_HEADERS:
                    writer.writeheader()
                    WRITTEN_HEADERS.add(filename_for_this_file)
                just_these_fields = {k:v for k,v in line.items() if k in column_names_for_this_file}
                writer.writerow(just_these_fields)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-fn", "--filename", required=True, default='large_file.txt', help="filename of large file to be split")
    parser.add_argument("-n", "--num_files", required=False, default=4, help="number of separate files to split large_file into")
    args = parser.parse_args()
    main(args.filename, int(args.num_files))

Answer 3

这里有几个使用bash工具的解决方案。

1. bash

在bash循环内使用cut 。这将引发n进程并解析文件n次。

对于这种情况的更新，我们在列名中不仅有一系列字母作为 _ids，而且还有许多字符串 id，在前 7 行之后每 3 行重复一次。 我们必须首先读取文件的 header 并提取它们，例如，一个快速的解决方案是使用awk并每隔 8、11 等列将它们打印到 ZD574D4BB40C84861791A694Z 数组中。9CCE69A

#!/bin/bash
first=7
#ids=( {a..n} )
ids=( $( head -1 "$1" | awk -F"_" -v RS="," -v f="$first" 'NR>f && (NR+1)%3==0{print $2}' ) )

for i in "${!ids[@]}"; do
    cols="1-$first,$((first+1+3*i)),$((first+2+3*i)),$((first+3+3*i))"
    cut -d, -f"$cols" "$1" > "${ids[i]}.txt"
done

用法： bash test.sh file

2. awk

或者您可以使用awk 。 这里我只自定义输出的数量，但其他的也可以像第一个解决方案一样完成。

BEGIN { FS=OFS=","; times=14 } 
{ 
  for (i=1;i<=times;i++) {
    print $1,$2,$3,$4,$5,$6,$7,$(5+3*i),$(6+3*i),$(7+3*i) > sprintf("%c.txt",i+96)
  }
}

用法： awk -f test.awk file 。

这个解决方案应该很快，因为它解析文件一次。 但不应该这样使用，对于大量的 output 文件，可能会抛出“打开的文件过多”错误。 对于字母的范围，应该没问题。

将大文件拆分为 n 个文件，保留前 7 列 + 后 3 列，直到第 n 列

问题描述

3 个解决方案

解决方案1
1 2020-08-09 18:54:27

解决方案2
1 2020-08-09 21:37:04

解决方案3
1 已采纳 2020-08-10 05:42:17

将大文件拆分为 n 个文件，保留前 7 列 + 后 3 列，直到第 n 列

问题描述

3 个解决方案

解决方案1 1 2020-08-09 18:54:27

解决方案2 1 2020-08-09 21:37:04

解决方案3 1 已采纳 2020-08-10 05:42:17

解决方案1
1 2020-08-09 18:54:27

解决方案2
1 2020-08-09 21:37:04

解决方案3
1 已采纳 2020-08-10 05:42:17