[英]Shift values in column depending on next value and fill empty entries
I have a data wrangling problem I am not sure how to solve.我有一个数据争论的问题,我不知道如何解决。 I have a dataframe in which the rows on one of the columns are both shifted up and this column is not completely filled.
我有一个 dataframe ,其中一列上的行都向上移动,并且该列没有完全填充。 I need to shift the rows down and fill X number of rows, depending on how much data there is in the other columns.
我需要将行向下移动并填充 X 行,具体取决于其他列中有多少数据。
EDIT : I have changed how I displayed the data.编辑:我改变了我显示数据的方式。 Before, I had pasted as a markdown table, which induced people in error.
之前我贴的是markdown表,引人误会。 I am sorry for that.
我为此感到抱歉。 The data I am dealing with looks like this:
我正在处理的数据如下所示:
code IdGene Type COGgene PosLeft postRight Strand Function
1
1075082 CDS ROG0189 93 710 + NA
8
1075089 CDS COG0226 5632 6741 + [P] ABC-type phosphate transport system, periplasmic component
1075103 CDS NA 6796 7869 + NA
9
1075105 CDS NA 8075 8923 + NA
1075096 CDS ROG0189 8983 10149 + NA
1071820 CDS NA 10181 10723 + NA
10
1071880 CDS COG0642 10893 13316 + [T] Signal transduction histidine kinase
1072052 CDS COG2204 13288 14586 + [T] Response regulator containing CheY-like receiver, AAA-type
12
1075092 CDS NA 15525 16472 + NA
13
1075087 CDS NA 16655 17371 + NA
1074837 CDS NA 17383 17703 + NA
1071956 CDS NA 17710 18168 + NA
14
1071684 CDS NA 18251 18919 - NA
15
1075519 CDS ROG5478 19044 19334 + NA
27
1075067 CDS ROG8331 35989 36417 + NA
1075056 CDS COG2244 36478 38019 + [R] Membrane protein involved in the export
1075546 CDS COG1035 38016 39218 + [C] Coenzyme F420-reducing hydrogenase, beta subunit
1074004 CDS ROG1263 39215 40375 + NA
1075083 CDS COG1701 40406 40582 + [S] Uncharacterized protein conserved in archaea
1075068 CDS COG0463 40593 41537 + [M] Glycosyltransferases involved in cell wall biogenesis
1075064 CDS ROG2632 41534 42700 + NA
1075066 CDS COG0463 42724 43656 + [M] Glycosyltransferases involved in cell wall biogenesis
1075069 CDS COG1215 43671 44066 + [M] Glycosyltransferases, probably involved in cell wall
And I need to transform it into this:我需要把它变成这样:
code IdGene Type COGgene PosLeft postRight Strand Function
1 1075082 CDS ROG0189 93 710 + NA
8 1075089 CDS COG0226 5632 6741 + [P] ABC-type phosphate transport system, periplasmic component
8 1075103 CDS NA 6796 7869 + NA
9 1075105 CDS NA 8075 8923 + NA
9 1075096 CDS ROG0189 8983 10149 + NA
9 1071820 CDS NA 10181 10723 + NA
10 1071880 CDS COG0642 10893 13316 + [T] Signal transduction histidine kinase
10 1072052 CDS COG2204 13288 14586 + [T] Response regulator containing CheY-like receiver, AAA-type
12 1075092 CDS NA 15525 16472 + NA
13 1075087 CDS NA 16655 17371 + NA
13 1074837 CDS NA 17383 17703 + NA
13 1071956 CDS NA 17710 18168 + NA
14 1071684 CDS NA 18251 18919 - NA
15 1075519 CDS ROG5478 19044 19334 + NA
27 1075067 CDS ROG8331 35989 36417 + NA
27 1075056 CDS COG2244 36478 38019 + [R] Membrane protein involved in the export
27 1075546 CDS COG1035 38016 39218 + [C] Coenzyme F420-reducing hydrogenase, beta subunit
27 1074004 CDS ROG1263 39215 40375 + NA
27 1075083 CDS COG1701 40406 40582 + [S] Uncharacterized protein conserved in archaea
27 1075068 CDS COG0463 40593 41537 + [M] Glycosyltransferases involved in cell wall biogenesis
27 1075064 CDS ROG2632 41534 42700 + NA
27 1075066 CDS COG0463 42724 43656 + [M] Glycosyltransferases involved in cell wall biogenesis
27 1075069 CDS COG1215 43671 44066 + [M] Glycosyltransferases, probably involved in cell wall
Any ideas pointers on how to solve this would be great.关于如何解决这个问题的任何想法都会很棒。 Ideally in R, but awk or others fine too.
理想情况下在 R 中,但 awk 或其他也很好。
In case you are ok with formatting of output(means columns spaces) then you could try following in awk, also considering that you are reading data from an Input_file.如果您对输出格式(表示列空格)没问题,那么您可以尝试在 awk 中进行操作,同时考虑到您正在从 Input_file 读取数据。
awk '
BEGIN{
OFS="\t"
}
FNR==1 || FNR==2{
print
next
}
$2~/[0-9]+/{
value=$2
next
}
{
$2=value" | "}
1
' Input_file
This oneliner gives the expected result:这个 oneliner 给出了预期的结果:
awk -F '|' '1*$2{id=$2;next}NR<3||sub(/\s+/,id)' input
If the f
contains your input data:如果
f
包含您的输入数据:
$ awk -F '|' '1*$2{id=$2;next}NR<3||sub(/\s+/,id)' f
| code | IdGene | Type | COGgene | PosLeft | postRight | Strand | Function |
|------|---------|------|---------|---------|-----------|--------|----------|
| 1 | 1075082 | CDS | ROG0189 | 93 | 710 | + | NA |
| 2 | 1075099 | CDS | NA | 783 | 1778 | + | NA |
| 3 | 1073305 | CDS | NA | 1872 | 2648 | + | NA |
| 4 | 1075537 | CDS | NA | 2783 | 3451 | + | NA |
| 4 | 1074931 | CDS | COG0186 | 3460 | 3996 | + | KO |
| 5 | 1075097 | CDS | NA | 4088 | 4534 | + | NA |
| 5 | 1074010 | CDS | NA | 4457 | 4849 | - | NA |
| 5 | 1075093 | CDS | ROG5695 | 4958 | 5503 | + | NA |
| 5 | 1075089 | CDS | COG0226 | 5632 | 6741 | + | KO |
| 5 | 1075103 | CDS | NA | 6796 | 7869 | + | NA |
| 5 | 1075105 | CDS | NA | 8075 | 8923 | + | NA |
| 5 | 1075096 | CDS | ROG0189 | 8983 | 10149 | + | NA |
| 5 | 1071820 | CDS | NA | 10181 | 10723 | + | NA |
This one-liner will work for the new input and keep the output format:此单行代码适用于新输入并保持 output 格式:
awk 'NF<2{id=$1;next}NR==1||sub("\\s{"length(id)"}",id)' file
Test again with the input data in f
:使用
f
中的输入数据再次测试:
$ awk 'NF<2{id=$1;next}NR==1||sub("\\s{"length(id)"}",id)' f
code IdGene Type COGgene PosLeft postRight Strand Function
1 1075082 CDS ROG0189 93 710 + NA
8 1075089 CDS COG0226 5632 6741 + [P] ABC-type phosphate transport system, periplasmic component
8 1075103 CDS NA 6796 7869 + NA
9 1075105 CDS NA 8075 8923 + NA
9 1075096 CDS ROG0189 8983 10149 + NA
9 1071820 CDS NA 10181 10723 + NA
10 1071880 CDS COG0642 10893 13316 + [T] Signal transduction histidine kinase
10 1072052 CDS COG2204 13288 14586 + [T] Response regulator containing CheY-like receiver, AAA-type
12 1075092 CDS NA 15525 16472 + NA
13 1075087 CDS NA 16655 17371 + NA
13 1074837 CDS NA 17383 17703 + NA
13 1071956 CDS NA 17710 18168 + NA
14 1071684 CDS NA 18251 18919 - NA
15 1075519 CDS ROG5478 19044 19334 + NA
27 1075067 CDS ROG8331 35989 36417 + NA
27 1075056 CDS COG2244 36478 38019 + [R] Membrane protein involved in the export
27 1075546 CDS COG1035 38016 39218 + [C] Coenzyme F420-reducing hydrogenase, beta subunit
27 1074004 CDS ROG1263 39215 40375 + NA
27 1075083 CDS COG1701 40406 40582 + [S] Uncharacterized protein conserved in archaea
27 1075068 CDS COG0463 40593 41537 + [M] Glycosyltransferases involved in cell wall biogenesis
27 1075064 CDS ROG2632 41534 42700 + NA
27 1075066 CDS COG0463 42724 43656 + [M] Glycosyltransferases involved in cell wall biogenesis
27 1075069 CDS COG1215 43671 44066 + [M] Glycosyltransferases, probably involved in cell
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.