[英]grep/sed/awk parsing text file to print multiple rows after a pattern matches and convert to one row
我想使用 grep/awk/sed 來解析包含多個基因的各種描述的文本文件。 我希望每一行代表一個基因描述。
現在我想將自動和簡明描述提取到單個 txt 文件中,每行代表單個基因的描述。
下載文件
wget https://downloads.wormbase.org/releases/current-production-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS283.functional_descriptions.txt.gz
我已經能夠使用下面的代碼提取所需的文本並擁有單獨的文本文件。 但是,我無法將 output 文本分成單行。
awk '/Concise description:/{flag=1} flag; /Automated description/{flag=0}' c_elegans.PRJNA13758.WS283.functional_descriptions.txt | grep -v "Automated description" > WB283_concise.txt
#do this for the next section automated description
awk '/Automated description:/{flag=1} flag; /Gene class description/{flag=0}' c_elegans.PRJNA13758.WS283.functional_descriptions.txt | grep -v "Gene class description" > WB283_automated.txt
#I can also use sed
sed -ne '/Concise description:/,$ p' WB283_concise.txt > concise.txt
有人可以幫忙嗎?
1 個基因描述的當前文本結構
Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide
3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan
and dauer development, and likely functions as the sole adaptor subunit for the
AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates
insulin-like signaling, it is not absolutely required for insulin-like signaling
under most conditions.
1 個基因描述所需的文本結構
Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan and dauer development, and likely functions as the sole adaptor subunit for the AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates insulin-like signaling, it is not absolutely required for insulin-like signaling under most conditions.
謝謝你,何塞。
對 OP 當前awk
代碼的一些小改動:
awk '
/Concise description:/ { flag=1; pfx="" }
/Automated description/ { flag=0; print "" } # close out current printf line out output
flag { printf "%s%s",pfx,$0; pfx=" " } # assuming appended lines are separated by a single space
' file
注意:我不確定我是否理解 OP 當前對grep -v
的使用,因為我們沒有一組樣本輸入來證明需要grep -v
...?
對於提供的小樣本,這會生成:
Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan and dauer development, and likely functions as the sole adaptor subunit for the AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates insulin-like signaling, it is not absolutely required for insulin-like signaling under most conditions.
假設:
Concise
或Automated
文本塊,所有輸入都將路由到兩個 output 文件之一我們可以將 OP 當前的 2x awk
腳本合並為一個,例如:
awk '
function close_line() { if (outfile) print "" > outfile } # close out prior printf line of output?
/Concise description:/ { close_line()
outfile="WB283_concise.txt"
pfx=""
}
/Automated description:/ { close_line()
outfile="WB283_automated.txt"
pfx=""
}
/Gene class description/ { close_line()
outfile=""
}
outfile { printf "%s%s", pfx, $0 > outfile
pfx=" "
}
END { close_line() }
' file
我可以建議一個稍微修改的解決方案(不完全是所要求的,但有可能有用的想法):
awk '
/WBGene/ { printf("\n%s: ", $2) }
/Concise description/ { flag = 1; $1=$2="" }
/=/ { flag = 0 }
/^.* description/ { flag = 0 }
flag { printf " %s", $0 }
' c_elegans.PRJNA13758.WS283.functional_descriptions.txt
我們的想法是過濾掉字符串“Concise description”,因為這是我們在任何情況下都在尋找的內容。 基因的名稱打印在第一列中,因為許多“簡明描述”不包括名稱。
Output 格式是每個基因單行,以其名稱(+ 冒號)開頭,然后是“純”簡潔描述。
順便說一句:如果你想創建第二個 output,每行都有“自動描述”,將第二個 awk 行從/Concise description/
更改為/Automated description/
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.