I have a data wrangling problem I am not sure how to solve. I have a dataframe in which the rows on one of the columns are both shifted up and this column is not completely filled. I need to shift the rows down and fill X number of rows, depending on how much data there is in the other columns.
EDIT : I have changed how I displayed the data. Before, I had pasted as a markdown table, which induced people in error. I am sorry for that. The data I am dealing with looks like this:
code IdGene Type COGgene PosLeft postRight Strand Function
1
1075082 CDS ROG0189 93 710 + NA
8
1075089 CDS COG0226 5632 6741 + [P] ABC-type phosphate transport system, periplasmic component
1075103 CDS NA 6796 7869 + NA
9
1075105 CDS NA 8075 8923 + NA
1075096 CDS ROG0189 8983 10149 + NA
1071820 CDS NA 10181 10723 + NA
10
1071880 CDS COG0642 10893 13316 + [T] Signal transduction histidine kinase
1072052 CDS COG2204 13288 14586 + [T] Response regulator containing CheY-like receiver, AAA-type
12
1075092 CDS NA 15525 16472 + NA
13
1075087 CDS NA 16655 17371 + NA
1074837 CDS NA 17383 17703 + NA
1071956 CDS NA 17710 18168 + NA
14
1071684 CDS NA 18251 18919 - NA
15
1075519 CDS ROG5478 19044 19334 + NA
27
1075067 CDS ROG8331 35989 36417 + NA
1075056 CDS COG2244 36478 38019 + [R] Membrane protein involved in the export
1075546 CDS COG1035 38016 39218 + [C] Coenzyme F420-reducing hydrogenase, beta subunit
1074004 CDS ROG1263 39215 40375 + NA
1075083 CDS COG1701 40406 40582 + [S] Uncharacterized protein conserved in archaea
1075068 CDS COG0463 40593 41537 + [M] Glycosyltransferases involved in cell wall biogenesis
1075064 CDS ROG2632 41534 42700 + NA
1075066 CDS COG0463 42724 43656 + [M] Glycosyltransferases involved in cell wall biogenesis
1075069 CDS COG1215 43671 44066 + [M] Glycosyltransferases, probably involved in cell wall
And I need to transform it into this:
code IdGene Type COGgene PosLeft postRight Strand Function
1 1075082 CDS ROG0189 93 710 + NA
8 1075089 CDS COG0226 5632 6741 + [P] ABC-type phosphate transport system, periplasmic component
8 1075103 CDS NA 6796 7869 + NA
9 1075105 CDS NA 8075 8923 + NA
9 1075096 CDS ROG0189 8983 10149 + NA
9 1071820 CDS NA 10181 10723 + NA
10 1071880 CDS COG0642 10893 13316 + [T] Signal transduction histidine kinase
10 1072052 CDS COG2204 13288 14586 + [T] Response regulator containing CheY-like receiver, AAA-type
12 1075092 CDS NA 15525 16472 + NA
13 1075087 CDS NA 16655 17371 + NA
13 1074837 CDS NA 17383 17703 + NA
13 1071956 CDS NA 17710 18168 + NA
14 1071684 CDS NA 18251 18919 - NA
15 1075519 CDS ROG5478 19044 19334 + NA
27 1075067 CDS ROG8331 35989 36417 + NA
27 1075056 CDS COG2244 36478 38019 + [R] Membrane protein involved in the export
27 1075546 CDS COG1035 38016 39218 + [C] Coenzyme F420-reducing hydrogenase, beta subunit
27 1074004 CDS ROG1263 39215 40375 + NA
27 1075083 CDS COG1701 40406 40582 + [S] Uncharacterized protein conserved in archaea
27 1075068 CDS COG0463 40593 41537 + [M] Glycosyltransferases involved in cell wall biogenesis
27 1075064 CDS ROG2632 41534 42700 + NA
27 1075066 CDS COG0463 42724 43656 + [M] Glycosyltransferases involved in cell wall biogenesis
27 1075069 CDS COG1215 43671 44066 + [M] Glycosyltransferases, probably involved in cell wall
Any ideas pointers on how to solve this would be great. Ideally in R, but awk or others fine too.
In case you are ok with formatting of output(means columns spaces) then you could try following in awk, also considering that you are reading data from an Input_file.
awk '
BEGIN{
OFS="\t"
}
FNR==1 || FNR==2{
print
next
}
$2~/[0-9]+/{
value=$2
next
}
{
$2=value" | "}
1
' Input_file
This oneliner gives the expected result:
awk -F '|' '1*$2{id=$2;next}NR<3||sub(/\s+/,id)' input
If the f
contains your input data:
$ awk -F '|' '1*$2{id=$2;next}NR<3||sub(/\s+/,id)' f
| code | IdGene | Type | COGgene | PosLeft | postRight | Strand | Function |
|------|---------|------|---------|---------|-----------|--------|----------|
| 1 | 1075082 | CDS | ROG0189 | 93 | 710 | + | NA |
| 2 | 1075099 | CDS | NA | 783 | 1778 | + | NA |
| 3 | 1073305 | CDS | NA | 1872 | 2648 | + | NA |
| 4 | 1075537 | CDS | NA | 2783 | 3451 | + | NA |
| 4 | 1074931 | CDS | COG0186 | 3460 | 3996 | + | KO |
| 5 | 1075097 | CDS | NA | 4088 | 4534 | + | NA |
| 5 | 1074010 | CDS | NA | 4457 | 4849 | - | NA |
| 5 | 1075093 | CDS | ROG5695 | 4958 | 5503 | + | NA |
| 5 | 1075089 | CDS | COG0226 | 5632 | 6741 | + | KO |
| 5 | 1075103 | CDS | NA | 6796 | 7869 | + | NA |
| 5 | 1075105 | CDS | NA | 8075 | 8923 | + | NA |
| 5 | 1075096 | CDS | ROG0189 | 8983 | 10149 | + | NA |
| 5 | 1071820 | CDS | NA | 10181 | 10723 | + | NA |
This one-liner will work for the new input and keep the output format:
awk 'NF<2{id=$1;next}NR==1||sub("\\s{"length(id)"}",id)' file
Test again with the input data in f
:
$ awk 'NF<2{id=$1;next}NR==1||sub("\\s{"length(id)"}",id)' f
code IdGene Type COGgene PosLeft postRight Strand Function
1 1075082 CDS ROG0189 93 710 + NA
8 1075089 CDS COG0226 5632 6741 + [P] ABC-type phosphate transport system, periplasmic component
8 1075103 CDS NA 6796 7869 + NA
9 1075105 CDS NA 8075 8923 + NA
9 1075096 CDS ROG0189 8983 10149 + NA
9 1071820 CDS NA 10181 10723 + NA
10 1071880 CDS COG0642 10893 13316 + [T] Signal transduction histidine kinase
10 1072052 CDS COG2204 13288 14586 + [T] Response regulator containing CheY-like receiver, AAA-type
12 1075092 CDS NA 15525 16472 + NA
13 1075087 CDS NA 16655 17371 + NA
13 1074837 CDS NA 17383 17703 + NA
13 1071956 CDS NA 17710 18168 + NA
14 1071684 CDS NA 18251 18919 - NA
15 1075519 CDS ROG5478 19044 19334 + NA
27 1075067 CDS ROG8331 35989 36417 + NA
27 1075056 CDS COG2244 36478 38019 + [R] Membrane protein involved in the export
27 1075546 CDS COG1035 38016 39218 + [C] Coenzyme F420-reducing hydrogenase, beta subunit
27 1074004 CDS ROG1263 39215 40375 + NA
27 1075083 CDS COG1701 40406 40582 + [S] Uncharacterized protein conserved in archaea
27 1075068 CDS COG0463 40593 41537 + [M] Glycosyltransferases involved in cell wall biogenesis
27 1075064 CDS ROG2632 41534 42700 + NA
27 1075066 CDS COG0463 42724 43656 + [M] Glycosyltransferases involved in cell wall biogenesis
27 1075069 CDS COG1215 43671 44066 + [M] Glycosyltransferases, probably involved in cell
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.