简体   繁体   English

在bash中分解文本文件

[英]Break down text file in bash

I have a text file in the following format: 我有以下格式的文本文件:

variableStep chrom=chr1 span=10
10161   1
10171   1
10181   2
10191   2
10201   2
10211   2
10221   2
10231   2
10241   2
10251   1
variableStep chrom=chr10 span=10
70711   1
70721   2
70731   2
70741   2
70751   2
70761   2
70771   2
70781   2
70791   1
71161   1
71171   1
71181   1
variableStep chrom=chr11 span=10
104731  1
104741  1
104751  1
104761  1
104771  1
104781  1
104791  1
104801  1
128711  1
128721  1
128731  1

I need a way to break this down into several files named for example "chr1.txt", "chr10.txt and "chr11.txt". How would I go about doing this? 我需要一种将其分解为几个文件的方式,例如“ chr1.txt”,“ chr10.txt和” chr11.txt”,我该怎么做?

I about the the following way: 我关于以下方法:

cat file.txt | \
while IFS=$'\t' read  -r -a rowArray; do
    echo -e "${rowArray[0]}\t${rowArray[1]}\t${rowArray[2]}"
done > $file.mod.txt

That reads line by line and then saves line by line. 逐行读取,然后逐行保存。 However, I need something a little more elaborate that spans rows. 但是,我需要一些更详细的内容来涵盖行。 "chr1.txt" would include everything from the row 10161 1 to row 10251 1, "chr10.txt" would include everything from the row 70711 1 to row 71181 1, etc. It's also specific in that I have to read in the actual chr# from each line as well, and save that as the file name. “ chr1.txt”将包括从行10161 1到行10251 1的所有内容,“ chr10.txt”将包括从行70711 1到行71181 1的所有内容,依此类推。这也很具体,我必须在实际中阅读以及每行中的chr#,并将其保存为文件名。

The help is really appreciated. 非常感谢您的帮助。

awk -F'[ =]' '
  $1 == "variableStep" {file = $3 ".txt"; next}
  file != "" {print > file}' < input.txt

This worked for me: 这为我工作:

IFS=$'\n'
curfile=""
content=($(< file.txt))
for ((idx = 0; idx < ${#content[@]}; idx++)); do
    if [[ ${content[idx]} =~ ^.*chrom=(\\b.*?\\b)\ .*$ ]]; then
        curfile="${BASH_REMATCH[1]}.txt"
        rm -rf ${curfile}
    elif [ -n "${curfile}" ]; then
        echo ${content[idx]} >> ${curfile}
    fi
done

Awk is appropriate for this problem domain because the text file is already (more or less) organized into columns. Awk适用于此问题域,因为文本文件已经(或多或少)组织为列。 Here's what I would use: 这是我会用的:

awk 'NF == 3 && index($2, "=") { filename = substr($2, index($2, "=") + 1) }
     NF == 2 && filename { print $0 > (filename ".txt") }' < input.txt

Explanation: 说明:

Think of the lines starting with variableStep as "three columns" and the other lines as "two columns". 将以variableStep开头的行视为“三列”,将其他行视为“两列”。 The above script says, "Parse the text file line-by-line; if a line has three columns and the second column contains an '=' character, assign 'all of the characters in the second column that occur after the '=' character' to a variable called filename . If a line has two columns and the filename variable's been assigned, write the entire line to the file that's constructed by concatenating the string in the filename variable with '.txt'". 上面的脚本说:“逐行分析文本文件;如果一行包含三列,第二列包含'='字符,则分配'第二列中所有在'='之后出现的字符字符”添加到名为filename的变量中。如果一行包含两列,并且已分配filename变量,请将整行写入通过将filename变量中的字符串与'.txt'“连接起来而构成的文件中。

Notes: 笔记:

  • NF is a built-in variable in Awk that represents the "number of fields", where a "field" (in this case) can be thought of as a column of data. NF是Awk中的内置变量,表示“字段数”,其中“字段”(在这种情况下)可以视为数据列。
  • $0 and $2 are built-in variables that represent the entire line and the second column of data, respectively. $ 0$ 2是内置变量,分别代表整行和第二列数据。 ( $1 represents the first column, $3 represents the third column, etc...) $ 1代表第一列, $ 3代表第三列,依此类推...)
  • substr and index are built-in functions described here: http://www.gnu.org/software/gawk/manual/gawk.html#String-Functions The redirection operator (>) acts differently in Awk than it does in a shell script; substrindex是此处描述的内置函数: http : //www.gnu.org/software/gawk/manual/gawk.html#String-Functions重定向操作符(>)在Awk中的行为与在Shell中的行为不同脚本; subsequent writes to the same file are appended. 随后将写入同一文件。
  • String concatenation is performed simply by writing expressions next to each other. 字符串串联只需简单地将表达式彼此相邻地编写即可。 The parenthesis ensure the concatenation happens before the file gets written to. 括号可确保在写入文件之前发生串联。

More details can be found here: http://www.gnu.org/software/gawk/manual/gawk.html#Two-Rules 可在此处找到更多详细信息: http : //www.gnu.org/software/gawk/manual/gawk.html#Two-Rules

i used sed to filter .... 我用sed过滤....

code part : 代码部分:

Kaizen ~/so_test $ cat zsplit.sh 改善〜/ so_test $ cat zsplit.sh

cntr=1;
prev=1;
for curr in `cat ztmpfile2.txt | nl | grep variableStep | tr -s " " | cut -d" " -f2 | sed -n 's/variableStep//p'`
do
sed -n "$prev,$(( ${curr} - 1))p" ztmpfile2.txt > zchap$cntr.txt ;
#echo "displaying : : zchap$cntr.txt " ;
#cat zchap$cntr.txt ;
prev=$curr; cntr=$(( $cntr + 1 ));
done

 sed -n "$prev,$ p" ztmpfile2.txt > zchap$cntr.txt ;
 #echo "displaying : : zchap$cntr.txt " ;
 #cat zchap$cntr.txt ;

output : 输出:

Kaizen ~/so_test $  ./zsplit.sh
+ ./zsplit.sh
zchap1.txt :: 1 :: 1
displaying : : zchap1.txt
variableStep chrom=chr1 span=10
zchap2.txt :: 1 :: 12
displaying : : zchap2.txt
variableStep chrom=chr1 span=10
10161   1
10171   1
10181   2
10191   2
10201   2
10211   2
10221   2
10231   2
10241   2
10251   1
zchap3.txt :: 12 :: 25
displaying : : zchap3.txt
 variableStep chrom=chr10 span=10
70711   1
70721   2
70731   2
70741   2
70751   2
70761   2
70771   2
70781   2
70791   1
71161   1
71171   1
71181   1
displaying : : zchap4.txt
variableStep chrom=chr11 span=10
104731  1
104741  1
104751  1
104761  1
104771  1
104781  1
104791  1
104801  1
128711  1
128721  1
128731  1

from the result zchap* files , iff you want you can remove the line : variableStep chrom=chr11 span=10 by using sed -- sed -i '/variableStep/d' zchap* 从结果zchap *文件中,如果您希望删除该行,请使用sed- sed -i '/variableStep/d' zchap*删除以下行:variableStep chrom = chr11 span = 10

does this help ? 这有帮助吗?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM