简体   繁体   English

根据从文本文件中获取的模式将文本文件拆分为多个部分

[英]Split text file into parts based on a pattern taken from the text file

I have many text files of fixed-width data, eg: 我有很多固定宽度数据的文本文件,例如:

$ head model-q-060.txt 
% x                      y                        
15.0                     0.0                      
15.026087                -1.0                     
15.052174                -2.0                     
15.07826                 -3.0                     
15.104348                -4.0                     
15.130435                -5.0                     
15.156522                -6.0                     
15.182609                -6.9999995               
15.208695                -8.0  

The data comprise 3 or 4 runs of a simulation, all stored in the one text file, with no separator between runs. 数据包括3或4次模拟运行,全部存储在一个文本文件中,运行之间没有分隔符。 In other words, there is no empty line or anything, eg if there were only 3 'records' per run it would look like this for 3 runs: 换句话说,没有空行或任何东西,例如,如果每次运行只有3个'记录',它将在3次运行中看起来像这样:

$ head model-q-060.txt 
% x                      y                        
15.0                     0.0                      
15.026087                -1.0                     
15.052174                -2.0                     
15.0                     0.0                      
15.038486                -1.0                     
15.066712                -2.0                     
15.0                     0.0                      
15.041089                -1.0                     
15.087612                -2.0                     

It's a COMSOL Multiphysics output file for those interested. 对于那些感兴趣的人来说,这是一个COMSOL Multiphysics输出文件。 Visually you can tell where the new run data begin, as the first x-value is repeated (actually the entire second line might be the same for all of them). 从视觉上你可以看出新运行数据的开始位置,因为第一个x值重复(实际上整个第二行可能对所有数据都是相同的)。 So I need to firstly open the file and get this x-value, save it, then use it as a pattern to match with awk or csplit. 所以我需要首先打开文件并获取此x值,保存它,然后将其用作与awk或csplit匹配的模式。 I am struggling to work this out! 我正在努力解决这个问题!

csplit will do the job: csplit将完成这项工作:

$ csplit -z -f 'temp' -b '%02d.txt' model-q-060.txt /^15\.0\\s/ {*}

but I have to know the pattern to split on. 但我必须知道要拆分的模式。 This question is similar but each of my text files might have a different pattern to match: Split files based on file content and pattern matching . 这个问题很相似,但我的每个文本文件可能都有不同的匹配模式: 根据文件内容和模式匹配拆分文件

Ben. 本。

Here's a simple awk script that will do what you want: 这是一个简单的awk脚本,可以执行您想要的操作:

BEGIN { fn=0 }
NR==1 { next }
NR==2 { delim=$1 }
$1 == delim {
    f=sprintf("test%02d.txt",fn++);
    print "Creating " f
}

{ print $0 > f }
  1. initialize output file number 初始化输出文件编号
  2. ignore the first line 忽略第一行
  3. extract the delimiter from the second line 从第二行提取分隔符
  4. for every input line whose first token matches the delimiter, set up the output file name 对于第一个标记与分隔符匹配的每个输入行,设置输出文件名
  5. for all lines, write to the current output file 对于所有行,写入当前输出文件

This should do the job - test somewhere you don't have a lot of temp*.txt files: :) 这应该做的工作 - 测试你没有很多temp*.txt文件::)

rm -f temp*.txt

cat > f1.txt <<EOF
% x                      y                        
15.0                     0.0                      
15.026087                -1.0                     
15.052174                -2.0                     
15.0                     0.0                      
15.038486                -1.0                     
15.066712                -2.0                     
15.0                     0.0                      
15.041089                -1.0                     
15.087612                -2.0    
EOF

first=`awk 'NR==2{print $1}' f1.txt|sed 's/\\./\\\\./'`
echo --- Splitting by: $first

csplit -z -f temp -b %02d.txt f1.txt /^"$first"\\s/ {*}

for i in temp*.txt; do
  echo ---- $i
  cat $i
done

The output of the above is: 以上的输出是:

--- Splitting by: 15\.0
51
153
153
136
---- temp00.txt
% x                      y                        
---- temp01.txt
15.0                     0.0                      
15.026087                -1.0                     
15.052174                -2.0                     
---- temp02.txt
15.0                     0.0                      
15.038486                -1.0                     
15.066712                -2.0                     
---- temp03.txt
15.0                     0.0                      
15.041089                -1.0                     
15.087612                -2.0    

Of course, you will run into trouble if you have repeating second column value ( 15.0 in the above example) - solving that would be a tad harder - exercise left for the reader... 当然,如果你有重复的第二列值(在上面的例子中为15.0 ),你将遇到麻烦 - 解决这个问题会有点困难 - 为读者留下了一些练习......

If the amount of lines per run is constant, you could use this: 如果每次运行的行数不变,您可以使用:

cat your_file.txt | grep -P "^\d" | \
   split --lines=$(expr \( $(wc -l "your_file.txt" | \
   awk '{print $1'}) - 1 \) / number_of_runs)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM