根據分隔符將一個文件拆分為多個文件

Question

我有一個文件-| 作為每個部分之后的分隔符...需要使用 unix 為每個部分創建單獨的文件。

輸入文件示例

wertretr
ewretrtret
1212132323
000232
-|
ereteertetet
232434234
erewesdfsfsfs
0234342343
-|
jdhg3875jdfsgfd
sjdhfdbfjds
347674657435
-|

文件 1 中的預期結果

wertretr
ewretrtret
1212132323
000232
-|

文件 2 中的預期結果

ereteertetet
232434234
erewesdfsfsfs
0234342343
-|

文件 3 中的預期結果

jdhg3875jdfsgfd
sjdhfdbfjds
347674657435
-|

Answer 1

一個班輪，沒有編程。 （除了正則表達式等）

csplit --digits=2  --quiet --prefix=outfile infile "/-|/+1" "{*}"

測試： csplit (GNU coreutils) 8.30

在 Apple Mac 上使用的注意事項

“對於 OS X 用戶，請注意操作系統附帶的csplit版本不起作用。您需要 coreutils 中的版本（可通過 Homebrew 安裝），稱為gcsplit 。” — @丹尼爾

“只是補充一下，你可以讓 OS X 的版本工作（至少在 High Sierra 中）。你只需要稍微調整一下參數csplit -k -f=outfile infile "/-\\|/+1" "{3}" 。似乎不起作用的功能是"{*}" ，我必須具體說明分隔符的數量，並且需要添加-k以避免它在找不到時刪除所有輸出文件最后一個分隔符。另外，如果你想要--digits ，你需要使用-n代替。” — @Pebbl

Answer 2

awk '{f="file" NR; print $0 " -|"> f}' RS='-\\|'  input-file

說明（已編輯）：

RS是記錄分隔符，該解決方案使用 gnu awk 擴展名，允許它是多個字符。 NR是記錄號。

打印語句打印一條記錄，后跟" -|" 到名稱中包含記錄號的文件中。

Answer 3

Debian 有csplit ，但我不知道這是否適用於所有/大多數/其他發行版。 如果沒有，那么追蹤源代碼並編譯它應該不會太難......

Answer 4

我解決了一個稍微不同的問題，其中文件包含一行名稱，后面的文本應該放在那里。 這個 perl 代碼對我有用：

#!/path/to/perl -w

#comment the line below for UNIX systems
use Win32::Clipboard;

# Get command line flags

#print ($#ARGV, "\n");
if($#ARGV == 0) {
    print STDERR "usage: ncsplit.pl --mff -- filename.txt [...] \n\nNote that no space is allowed between the '--' and the related parameter.\n\nThe mff is found on a line followed by a filename.  All of the contents of filename.txt are written to that file until another mff is found.\n";
    exit;
}

# this package sets the ARGV count variable to -1;

use Getopt::Long;
my $mff = "";
GetOptions('mff' => \$mff);

# set a default $mff variable
if ($mff eq "") {$mff = "-#-"};
print ("using file switch=", $mff, "\n\n");

while($_ = shift @ARGV) {
    if(-f "$_") {
    push @filelist, $_;
    } 
}

# Could be more than one file name on the command line, 
# but this version throws away the subsequent ones.

$readfile = $filelist[0];

open SOURCEFILE, "<$readfile" or die "File not found...\n\n";
#print SOURCEFILE;

while (<SOURCEFILE>) {
  /^$mff (.*$)/o;
    $outname = $1;
#   print $outname;
#   print "right is: $1 \n";

if (/^$mff /) {

    open OUTFILE, ">$outname" ;
    print "opened $outname\n";
    }
    else {print OUTFILE "$_"};
  }

Answer 5

以下命令對我有用。 希望能幫助到你。

awk 'BEGIN{file = 0; filename = "output_" file ".txt"}
    /-|/ {getline; file ++; filename = "output_" file ".txt"}
    {print $0 > filename}' input

Answer 6

您也可以使用 awk。 我對 awk 不是很熟悉，但以下內容似乎對我有用。 它生成了 part1.txt、part2.txt、part3.txt 和 part4.txt。 請注意，生成的最后一個 partn.txt 文件是空的。 我不確定如何解決這個問題，但我相信它可以通過一些調整來完成。 有人有什么建議嗎？

awk_pattern 文件：

BEGIN{ fn = "part1.txt"; n = 1 }
{
   print > fn
   if (substr($0,1,2) == "-|") {
       close (fn)
       n++
       fn = "part" n ".txt"
   }
}

bash 命令：

awk -f awk_pattern input.file

Answer 7

這是一個 Python 3 腳本，它根據分隔符提供的文件名將文件拆分為多個文件。 示例輸入文件：

# Ignored

######## FILTER BEGIN foo.conf
This goes in foo.conf.
######## FILTER END

# Ignored

######## FILTER BEGIN bar.conf
This goes in bar.conf.
######## FILTER END

這是腳本：

#!/usr/bin/env python3

import os
import argparse

# global settings
start_delimiter = '######## FILTER BEGIN'
end_delimiter = '######## FILTER END'

# parse command line arguments
parser = argparse.ArgumentParser()
parser.add_argument("-i", "--input-file", required=True, help="input filename")
parser.add_argument("-o", "--output-dir", required=True, help="output directory")

args = parser.parse_args()

# read the input file
with open(args.input_file, 'r') as input_file:
    input_data = input_file.read()

# iterate through the input data by line
input_lines = input_data.splitlines()
while input_lines:
    # discard lines until the next start delimiter
    while input_lines and not input_lines[0].startswith(start_delimiter):
        input_lines.pop(0)

    # corner case: no delimiter found and no more lines left
    if not input_lines:
        break

    # extract the output filename from the start delimiter
    output_filename = input_lines.pop(0).replace(start_delimiter, "").strip()
    output_path = os.path.join(args.output_dir, output_filename)

    # open the output file
    print("extracting file: {0}".format(output_path))
    with open(output_path, 'w') as output_file:
        # while we have lines left and they don't match the end delimiter
        while input_lines and not input_lines[0].startswith(end_delimiter):
            output_file.write("{0}\n".format(input_lines.pop(0)))

        # remove end delimiter if present
        if not input_lines:
            input_lines.pop(0)

最后是你如何運行它：

$ python3 script.py -i input-file.txt -o ./output-folder/

Answer 8

如果有，請使用csplit 。

如果你沒有，但你有 Python……不要使用 Perl。

懶惰讀取文件

您的文件可能太大而無法一次全部保存在內存中 - 逐行閱讀可能更可取。 假設輸入文件名為“samplein”：

$ python3 -c "from itertools import count
with open('samplein') as file:
    for i in count():
        firstline = next(file, None)
        if firstline is None:
            break
        with open(f'out{i}', 'w') as out:
            out.write(firstline)
            for line in file:
                out.write(line)
                if line == '-|\n':
                    break"

Answer 9

cat file| ( I=0; echo -n "">file0; while read line; do echo $line >> file$I; if [ "$line" == '-|' ]; then I=$[I+1]; echo -n "" > file$I; fi; done )

和格式化版本：

#!/bin/bash
cat FILE | (
  I=0;
  echo -n"">file0;
  while read line; 
  do
    echo $line >> file$I;
    if [ "$line" == '-|' ];
    then I=$[I+1];
      echo -n "" > file$I;
    fi;
  done;
)

Answer 10

這是一個可以做這件事的 perl 代碼

#!/usr/bin/perl
open(FI,"file.txt") or die "Input file not found";
$cur=0;
open(FO,">res.$cur.txt") or die "Cannot open output file $cur";
while(<FI>)
{
    print FO $_;
    if(/^-\|/)
    {
        close(FO);
        $cur++;
        open(FO,">res.$cur.txt") or die "Cannot open output file $cur"
    }
}
close(FO);

Answer 11

這是我為以下問題編寫的上下文拆分問題： http : //stromberg.dnsalias.org/~strombrg/context-split.html

$ ./context-split -h
usage:
./context-split [-s separator] [-n name] [-z length]
        -s specifies what regex should separate output files
        -n specifies how output files are named (default: numeric
        -z specifies how long numbered filenames (if any) should be
        -i include line containing separator in output files
        operations are always performed on stdin

根據分隔符將一個文件拆分為多個文件

問題描述

11 個解決方案

解決方案1
103 2012-07-03 16:07:14

在 Apple Mac 上使用的注意事項

解決方案2
42 2012-07-03 16:04:39

解決方案3
7 2012-07-03 15:42:42

解決方案4
5 2012-12-01 00:27:02

解決方案5
4 2017-02-07 19:40:56

解決方案6
2 2012-07-03 16:00:01

解決方案7
2 2017-02-19 19:33:57

解決方案8
2 2017-10-24 20:10:56

懶惰讀取文件

解決方案9
1 2012-07-03 15:49:01

解決方案10
0 2012-07-03 16:00:50

解決方案11
0 2012-07-03 17:17:59

根據分隔符將一個文件拆分為多個文件

問題描述

11 個解決方案

解決方案1 103 2012-07-03 16:07:14

在 Apple Mac 上使用的注意事項

解決方案2 42 2012-07-03 16:04:39

解決方案3 7 2012-07-03 15:42:42

解決方案4 5 2012-12-01 00:27:02

解決方案5 4 2017-02-07 19:40:56

解決方案6 2 2012-07-03 16:00:01

解決方案7 2 2017-02-19 19:33:57

解決方案8 2 2017-10-24 20:10:56

懶惰讀取文件

解決方案9 1 2012-07-03 15:49:01

解決方案10 0 2012-07-03 16:00:50

解決方案11 0 2012-07-03 17:17:59

解決方案1
103 2012-07-03 16:07:14

解決方案2
42 2012-07-03 16:04:39

解決方案3
7 2012-07-03 15:42:42

解決方案4
5 2012-12-01 00:27:02

解決方案5
4 2017-02-07 19:40:56

解決方案6
2 2012-07-03 16:00:01

解決方案7
2 2017-02-19 19:33:57

解決方案8
2 2017-10-24 20:10:56

解決方案9
1 2012-07-03 15:49:01

解決方案10
0 2012-07-03 16:00:50

解決方案11
0 2012-07-03 17:17:59