[英]Break down text file in bash
I have a text file in the following format: 我有以下格式的文本文件:
variableStep chrom=chr1 span=10
10161 1
10171 1
10181 2
10191 2
10201 2
10211 2
10221 2
10231 2
10241 2
10251 1
variableStep chrom=chr10 span=10
70711 1
70721 2
70731 2
70741 2
70751 2
70761 2
70771 2
70781 2
70791 1
71161 1
71171 1
71181 1
variableStep chrom=chr11 span=10
104731 1
104741 1
104751 1
104761 1
104771 1
104781 1
104791 1
104801 1
128711 1
128721 1
128731 1
I need a way to break this down into several files named for example "chr1.txt", "chr10.txt and "chr11.txt". How would I go about doing this? 我需要一种将其分解为几个文件的方式,例如“ chr1.txt”,“ chr10.txt和” chr11.txt”,我该怎么做?
I about the the following way: 我关于以下方法:
cat file.txt | \
while IFS=$'\t' read -r -a rowArray; do
echo -e "${rowArray[0]}\t${rowArray[1]}\t${rowArray[2]}"
done > $file.mod.txt
That reads line by line and then saves line by line. 逐行读取,然后逐行保存。 However, I need something a little more elaborate that spans rows. 但是,我需要一些更详细的内容来涵盖行。 "chr1.txt" would include everything from the row 10161 1 to row 10251 1, "chr10.txt" would include everything from the row 70711 1 to row 71181 1, etc. It's also specific in that I have to read in the actual chr# from each line as well, and save that as the file name. “ chr1.txt”将包括从行10161 1到行10251 1的所有内容,“ chr10.txt”将包括从行70711 1到行71181 1的所有内容,依此类推。这也很具体,我必须在实际中阅读以及每行中的chr#,并将其保存为文件名。
The help is really appreciated. 非常感谢您的帮助。
awk -F'[ =]' '
$1 == "variableStep" {file = $3 ".txt"; next}
file != "" {print > file}' < input.txt
This worked for me: 这为我工作:
IFS=$'\n'
curfile=""
content=($(< file.txt))
for ((idx = 0; idx < ${#content[@]}; idx++)); do
if [[ ${content[idx]} =~ ^.*chrom=(\\b.*?\\b)\ .*$ ]]; then
curfile="${BASH_REMATCH[1]}.txt"
rm -rf ${curfile}
elif [ -n "${curfile}" ]; then
echo ${content[idx]} >> ${curfile}
fi
done
Awk is appropriate for this problem domain because the text file is already (more or less) organized into columns. Awk适用于此问题域,因为文本文件已经(或多或少)组织为列。 Here's what I would use: 这是我会用的:
awk 'NF == 3 && index($2, "=") { filename = substr($2, index($2, "=") + 1) }
NF == 2 && filename { print $0 > (filename ".txt") }' < input.txt
Explanation: 说明:
Think of the lines starting with variableStep as "three columns" and the other lines as "two columns". 将以variableStep开头的行视为“三列”,将其他行视为“两列”。 The above script says, "Parse the text file line-by-line; if a line has three columns and the second column contains an '=' character, assign 'all of the characters in the second column that occur after the '=' character' to a variable called filename
. If a line has two columns and the filename
variable's been assigned, write the entire line to the file that's constructed by concatenating the string in the filename variable with '.txt'". 上面的脚本说:“逐行分析文本文件;如果一行包含三列,第二列包含'='字符,则分配'第二列中所有在'='之后出现的字符字符”添加到名为filename
的变量中。如果一行包含两列,并且已分配filename
变量,请将整行写入通过将filename变量中的字符串与'.txt'“连接起来而构成的文件中。
Notes: 笔记:
More details can be found here: http://www.gnu.org/software/gawk/manual/gawk.html#Two-Rules 可在此处找到更多详细信息: http : //www.gnu.org/software/gawk/manual/gawk.html#Two-Rules
i used sed to filter .... 我用sed过滤....
code part : 代码部分:
Kaizen ~/so_test $ cat zsplit.sh 改善〜/ so_test $ cat zsplit.sh
cntr=1;
prev=1;
for curr in `cat ztmpfile2.txt | nl | grep variableStep | tr -s " " | cut -d" " -f2 | sed -n 's/variableStep//p'`
do
sed -n "$prev,$(( ${curr} - 1))p" ztmpfile2.txt > zchap$cntr.txt ;
#echo "displaying : : zchap$cntr.txt " ;
#cat zchap$cntr.txt ;
prev=$curr; cntr=$(( $cntr + 1 ));
done
sed -n "$prev,$ p" ztmpfile2.txt > zchap$cntr.txt ;
#echo "displaying : : zchap$cntr.txt " ;
#cat zchap$cntr.txt ;
output : 输出:
Kaizen ~/so_test $ ./zsplit.sh
+ ./zsplit.sh
zchap1.txt :: 1 :: 1
displaying : : zchap1.txt
variableStep chrom=chr1 span=10
zchap2.txt :: 1 :: 12
displaying : : zchap2.txt
variableStep chrom=chr1 span=10
10161 1
10171 1
10181 2
10191 2
10201 2
10211 2
10221 2
10231 2
10241 2
10251 1
zchap3.txt :: 12 :: 25
displaying : : zchap3.txt
variableStep chrom=chr10 span=10
70711 1
70721 2
70731 2
70741 2
70751 2
70761 2
70771 2
70781 2
70791 1
71161 1
71171 1
71181 1
displaying : : zchap4.txt
variableStep chrom=chr11 span=10
104731 1
104741 1
104751 1
104761 1
104771 1
104781 1
104791 1
104801 1
128711 1
128721 1
128731 1
from the result zchap* files , iff you want you can remove the line : variableStep chrom=chr11 span=10 by using sed -- sed -i '/variableStep/d' zchap*
从结果zchap *文件中,如果您希望删除该行,请使用sed- sed -i '/variableStep/d' zchap*
删除以下行:variableStep chrom = chr11 span = 10
does this help ? 这有帮助吗?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.