grep/sed/awk 解析文本文件以在模式匹配后打印多行並轉換為一行

Question

我想使用 grep/awk/sed 來解析包含多個基因的各種描述的文本文件。 我希望每一行代表一個基因描述。

現在我想將自動和簡明描述提取到單個 txt 文件中，每行代表單個基因的描述。

下載文件

wget https://downloads.wormbase.org/releases/current-production-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS283.functional_descriptions.txt.gz

我已經能夠使用下面的代碼提取所需的文本並擁有單獨的文本文件。 但是，我無法將 output 文本分成單行。

awk '/Concise description:/{flag=1} flag; /Automated description/{flag=0}' c_elegans.PRJNA13758.WS283.functional_descriptions.txt | grep -v "Automated description" > WB283_concise.txt

#do this for the next section automated description

awk '/Automated description:/{flag=1} flag; /Gene class description/{flag=0}' c_elegans.PRJNA13758.WS283.functional_descriptions.txt | grep -v "Gene class description" > WB283_automated.txt

#I can also use sed
sed -ne '/Concise description:/,$ p' WB283_concise.txt > concise.txt

有人可以幫忙嗎？

1 個基因描述的當前文本結構

Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide 
3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan 
and dauer development, and likely functions as the sole adaptor subunit for the 
AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates 
insulin-like signaling, it is not absolutely required for insulin-like signaling 
under most conditions.

1 個基因描述所需的文本結構

Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan and dauer development, and likely functions as the sole adaptor subunit for the AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates insulin-like signaling, it is not absolutely required for insulin-like signaling under most conditions.

謝謝你，何塞。

Answer 1

對 OP 當前awk代碼的一些小改動：

awk '
/Concise description:/  { flag=1; pfx="" }
/Automated description/ { flag=0; print "" }                # close out current printf line out output
flag                    { printf "%s%s",pfx,$0; pfx=" " }   # assuming appended lines are separated by a single space
' file

注意：我不確定我是否理解 OP 當前對grep -v的使用，因為我們沒有一組樣本輸入來證明需要grep -v ...？

對於提供的小樣本，這會生成：

Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide  3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan  and dauer development, and likely functions as the sole adaptor subunit for the  AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates  insulin-like signaling, it is not absolutely required for insulin-like signaling  under most conditions.

假設：

OP 需要解析輸入文件兩次（針對兩個不同的文本塊）
兩個不同的文本塊不重疊
輸入文件中可能有多個Concise或Automated文本塊，所有輸入都將路由到兩個 output 文件之一

我們可以將 OP 當前的 2x awk腳本合並為一個，例如：

awk '
function close_line()    { if (outfile) print "" > outfile }      # close out prior printf line of output?

/Concise description:/   { close_line()
                           outfile="WB283_concise.txt"
                           pfx=""
                         }
/Automated description:/ { close_line()
                           outfile="WB283_automated.txt"
                           pfx=""
                         }
/Gene class description/ { close_line()
                           outfile=""
                         }
outfile                  { printf "%s%s", pfx, $0 > outfile
                           pfx=" "
                         }
END                      { close_line() }
' file

Answer 2

我可以建議一個稍微修改的解決方案（不完全是所要求的，但有可能有用的想法）：

awk '
/WBGene/              { printf("\n%s: ", $2) }
/Concise description/ { flag = 1; $1=$2="" }
/=/                   { flag = 0 }
/^.* description/     { flag = 0 }
flag                  { printf " %s", $0 }
' c_elegans.PRJNA13758.WS283.functional_descriptions.txt

我們的想法是過濾掉字符串“Concise description”，因為這是我們在任何情況下都在尋找的內容。 基因的名稱打印在第一列中，因為許多“簡明描述”不包括名稱。

Output 格式是每個基因單行，以其名稱（+ 冒號）開頭，然后是“純”簡潔描述。

順便說一句：如果你想創建第二個 output，每行都有“自動描述”，將第二個 awk 行從/Concise description/更改為/Automated description/

grep/sed/awk 解析文本文件以在模式匹配后打印多行並轉換為一行

問題描述

2 個解決方案

解決方案1
0 已采納 2022-05-26 19:03:41

解決方案2
0 2022-05-26 20:20:51

grep/sed/awk 解析文本文件以在模式匹配后打印多行並轉換為一行

問題描述

2 個解決方案

解決方案1 0 已采納 2022-05-26 19:03:41

解決方案2 0 2022-05-26 20:20:51

解決方案1
0 已采納 2022-05-26 19:03:41

解決方案2
0 2022-05-26 20:20:51