簡體   English   中英

grep/sed/awk 解析文本文件以在模式匹配后打印多行並轉換為一行

[英]grep/sed/awk parsing text file to print multiple rows after a pattern matches and convert to one row

我想使用 grep/awk/sed 來解析包含多個基因的各種描述的文本文件。 我希望每一行代表一個基因描述。

現在我想將自動和簡明描述提取到單個 txt 文件中,每行代表單個基因的描述。

下載文件

wget https://downloads.wormbase.org/releases/current-production-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS283.functional_descriptions.txt.gz

我已經能夠使用下面的代碼提取所需的文本並擁有單獨的文本文件。 但是,我無法將 output 文本分成單行。

awk '/Concise description:/{flag=1} flag; /Automated description/{flag=0}' c_elegans.PRJNA13758.WS283.functional_descriptions.txt | grep -v "Automated description" > WB283_concise.txt

#do this for the next section automated description

awk '/Automated description:/{flag=1} flag; /Gene class description/{flag=0}' c_elegans.PRJNA13758.WS283.functional_descriptions.txt | grep -v "Gene class description" > WB283_automated.txt

#I can also use sed
sed -ne '/Concise description:/,$ p' WB283_concise.txt > concise.txt


有人可以幫忙嗎?

1 個基因描述的當前文本結構

Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide 
3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan 
and dauer development, and likely functions as the sole adaptor subunit for the 
AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates 
insulin-like signaling, it is not absolutely required for insulin-like signaling 
under most conditions. 

1 個基因描述所需的文本結構

Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide 3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan and dauer development, and likely functions as the sole adaptor subunit for the AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates insulin-like signaling, it is not absolutely required for insulin-like signaling under most conditions. 

謝謝你,何塞。

對 OP 當前awk代碼的一些小改動:

awk '
/Concise description:/  { flag=1; pfx="" }
/Automated description/ { flag=0; print "" }                # close out current printf line out output
flag                    { printf "%s%s",pfx,$0; pfx=" " }   # assuming appended lines are separated by a single space
' file

注意:我不確定我是否理解 OP 當前對grep -v的使用,因為我們沒有一組樣本輸入來證明需要grep -v ...?

對於提供的小樣本,這會生成:

Concise description: aap-1 encodes the C. elegans ortholog of the phosphoinositide  3-kinase (PI3K) p50/p55 adaptor/regulatory subunit; AAP-1 negatively regulates lifespan  and dauer development, and likely functions as the sole adaptor subunit for the  AGE-1/p110 PI3K catalytic subunit to which it binds in vitro; although AAP-1 potentiates  insulin-like signaling, it is not absolutely required for insulin-like signaling  under most conditions.

假設:

  • OP 需要解析輸入文件兩次(針對兩個不同的文本塊)
  • 兩個不同的文本塊不重疊
  • 輸入文件中可能有多個ConciseAutomated文本塊,所有輸入都將路由到兩個 output 文件之一

我們可以將 OP 當前的 2x awk腳本合並為一個,例如:

awk '
function close_line()    { if (outfile) print "" > outfile }      # close out prior printf line of output?

/Concise description:/   { close_line()
                           outfile="WB283_concise.txt"
                           pfx=""
                         }
/Automated description:/ { close_line()
                           outfile="WB283_automated.txt"
                           pfx=""
                         }
/Gene class description/ { close_line()
                           outfile=""
                         }
outfile                  { printf "%s%s", pfx, $0 > outfile
                           pfx=" "
                         }
END                      { close_line() }
' file

我可以建議一個稍微修改的解決方案(不完全是所要求的,但有可能有用的想法):

awk '
/WBGene/              { printf("\n%s: ", $2) }
/Concise description/ { flag = 1; $1=$2="" }
/=/                   { flag = 0 }
/^.* description/     { flag = 0 }
flag                  { printf " %s", $0 }
' c_elegans.PRJNA13758.WS283.functional_descriptions.txt

我們的想法是過濾掉字符串“Concise description”,因為這是我們在任何情況下都在尋找的內容。 基因的名稱打印在第一列中,因為許多“簡明描述”不包括名稱。

Output 格式是每個基因單行,以其名稱(+ 冒號)開頭,然后是“純”簡潔描述。

順便說一句:如果你想創建第二個 output,每行都有“自動描述”,將第二個 awk 行從/Concise description/更改為/Automated description/

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM