Shift values in column depending on next value and fill empty entries

Question

I have a data wrangling problem I am not sure how to solve. I have a dataframe in which the rows on one of the columns are both shifted up and this column is not completely filled. I need to shift the rows down and fill X number of rows, depending on how much data there is in the other columns.

EDIT : I have changed how I displayed the data. Before, I had pasted as a markdown table, which induced people in error. I am sorry for that. The data I am dealing with looks like this:

code    IdGene  Type    COGgene PosLeft postRight   Strand  Function
1
    1075082 CDS ROG0189 93  710 +   NA
8
    1075089 CDS COG0226 5632    6741    +   [P] ABC-type phosphate transport system, periplasmic component
    1075103 CDS NA  6796    7869    +   NA
9
    1075105 CDS NA  8075    8923    +   NA
    1075096 CDS ROG0189 8983    10149   +   NA
    1071820 CDS NA  10181   10723   +   NA
10
    1071880 CDS COG0642 10893   13316   +   [T] Signal transduction histidine kinase
    1072052 CDS COG2204 13288   14586   +   [T] Response regulator containing CheY-like receiver, AAA-type
12
    1075092 CDS NA  15525   16472   +   NA
13
    1075087 CDS NA  16655   17371   +   NA
    1074837 CDS NA  17383   17703   +   NA
    1071956 CDS NA  17710   18168   +   NA
14
    1071684 CDS NA  18251   18919   -   NA
15
    1075519 CDS ROG5478 19044   19334   +   NA
27
    1075067 CDS ROG8331 35989   36417   +   NA
    1075056 CDS COG2244 36478   38019   +   [R] Membrane protein involved in the export
    1075546 CDS COG1035 38016   39218   +   [C] Coenzyme F420-reducing hydrogenase, beta subunit
    1074004 CDS ROG1263 39215   40375   +   NA
    1075083 CDS COG1701 40406   40582   +   [S] Uncharacterized protein conserved in archaea
    1075068 CDS COG0463 40593   41537   +   [M] Glycosyltransferases involved in cell wall biogenesis
    1075064 CDS ROG2632 41534   42700   +   NA
    1075066 CDS COG0463 42724   43656   +   [M] Glycosyltransferases involved in cell wall biogenesis
    1075069 CDS COG1215 43671   44066   +   [M] Glycosyltransferases, probably involved in cell wall

And I need to transform it into this:

code    IdGene  Type    COGgene PosLeft postRight   Strand  Function
1   1075082 CDS ROG0189 93  710 +   NA
8   1075089 CDS COG0226 5632    6741    +   [P] ABC-type phosphate transport system, periplasmic component
8   1075103 CDS NA  6796    7869    +   NA
9   1075105 CDS NA  8075    8923    +   NA
9   1075096 CDS ROG0189 8983    10149   +   NA
9   1071820 CDS NA  10181   10723   +   NA
10  1071880 CDS COG0642 10893   13316   +   [T] Signal transduction histidine kinase
10  1072052 CDS COG2204 13288   14586   +   [T] Response regulator containing CheY-like receiver, AAA-type
12  1075092 CDS NA  15525   16472   +   NA
13  1075087 CDS NA  16655   17371   +   NA
13  1074837 CDS NA  17383   17703   +   NA
13  1071956 CDS NA  17710   18168   +   NA
14  1071684 CDS NA  18251   18919   -   NA
15  1075519 CDS ROG5478 19044   19334   +   NA
27  1075067 CDS ROG8331 35989   36417   +   NA
27  1075056 CDS COG2244 36478   38019   +   [R] Membrane protein involved in the export
27  1075546 CDS COG1035 38016   39218   +   [C] Coenzyme F420-reducing hydrogenase, beta subunit
27  1074004 CDS ROG1263 39215   40375   +   NA
27  1075083 CDS COG1701 40406   40582   +   [S] Uncharacterized protein conserved in archaea
27  1075068 CDS COG0463 40593   41537   +   [M] Glycosyltransferases involved in cell wall biogenesis
27  1075064 CDS ROG2632 41534   42700   +   NA
27  1075066 CDS COG0463 42724   43656   +   [M] Glycosyltransferases involved in cell wall biogenesis
27  1075069 CDS COG1215 43671   44066   +   [M] Glycosyltransferases, probably involved in cell wall

Any ideas pointers on how to solve this would be great. Ideally in R, but awk or others fine too.

Answer 1

In case you are ok with formatting of output(means columns spaces) then you could try following in awk, also considering that you are reading data from an Input_file.

awk '
BEGIN{
  OFS="\t"
}
FNR==1 || FNR==2{
  print
  next
}
$2~/[0-9]+/{
  value=$2
  next
}
{
  $2=value"    | "}
1
'  Input_file

Answer 2

This oneliner gives the expected result:

awk -F '|' '1*$2{id=$2;next}NR<3||sub(/\s+/,id)' input

If the f contains your input data:

$ awk -F '|' '1*$2{id=$2;next}NR<3||sub(/\s+/,id)' f
| code | IdGene  | Type | COGgene | PosLeft | postRight | Strand | Function |
|------|---------|------|---------|---------|-----------|--------|----------|
| 1    | 1075082 | CDS  | ROG0189 | 93      | 710       | +      | NA       |
| 2    | 1075099 | CDS  | NA      | 783     | 1778      | +      | NA       |
| 3    | 1073305 | CDS  | NA      | 1872    | 2648      | +      | NA       |
| 4    | 1075537 | CDS  | NA      | 2783    | 3451      | +      | NA       |
| 4    | 1074931 | CDS  | COG0186 | 3460    | 3996      | +      | KO       |
| 5    | 1075097 | CDS  | NA      | 4088    | 4534      | +      | NA       |
| 5    | 1074010 | CDS  | NA      | 4457    | 4849      | -      | NA       |
| 5    | 1075093 | CDS  | ROG5695 | 4958    | 5503      | +      | NA       |
| 5    | 1075089 | CDS  | COG0226 | 5632    | 6741      | +      | KO       |
| 5    | 1075103 | CDS  | NA      | 6796    | 7869      | +      | NA       |
| 5    | 1075105 | CDS  | NA      | 8075    | 8923      | +      | NA       |
| 5    | 1075096 | CDS  | ROG0189 | 8983    | 10149     | +      | NA       |
| 5    | 1071820 | CDS  | NA      | 10181   | 10723     | +      | NA       |

update for the input change:

This one-liner will work for the new input and keep the output format:

awk  'NF<2{id=$1;next}NR==1||sub("\\s{"length(id)"}",id)' file

Test again with the input data in f :

$ awk  'NF<2{id=$1;next}NR==1||sub("\\s{"length(id)"}",id)' f
code    IdGene  Type    COGgene PosLeft postRight   Strand  Function
1   1075082 CDS ROG0189 93  710 +   NA
8   1075089 CDS COG0226 5632    6741    +   [P] ABC-type phosphate transport system, periplasmic component
8   1075103 CDS NA  6796    7869    +   NA
9   1075105 CDS NA  8075    8923    +   NA
9   1075096 CDS ROG0189 8983    10149   +   NA
9   1071820 CDS NA  10181   10723   +   NA
10  1071880 CDS COG0642 10893   13316   +   [T] Signal transduction histidine kinase
10  1072052 CDS COG2204 13288   14586   +   [T] Response regulator containing CheY-like receiver, AAA-type
12  1075092 CDS NA  15525   16472   +   NA
13  1075087 CDS NA  16655   17371   +   NA
13  1074837 CDS NA  17383   17703   +   NA
13  1071956 CDS NA  17710   18168   +   NA
14  1071684 CDS NA  18251   18919   -   NA
15  1075519 CDS ROG5478 19044   19334   +   NA
27  1075067 CDS ROG8331 35989   36417   +   NA
27  1075056 CDS COG2244 36478   38019   +   [R] Membrane protein involved in the export
27  1075546 CDS COG1035 38016   39218   +   [C] Coenzyme F420-reducing hydrogenase, beta subunit
27  1074004 CDS ROG1263 39215   40375   +   NA
27  1075083 CDS COG1701 40406   40582   +   [S] Uncharacterized protein conserved in archaea
27  1075068 CDS COG0463 40593   41537   +   [M] Glycosyltransferases involved in cell wall biogenesis
27  1075064 CDS ROG2632 41534   42700   +   NA
27  1075066 CDS COG0463 42724   43656   +   [M] Glycosyltransferases involved in cell wall biogenesis
27  1075069 CDS COG1215 43671   44066   +   [M] Glycosyltransferases, probably involved in cell

Shift values in column depending on next value and fill empty entries

Question

2 answers

solution1
2 2020-04-29 17:17:11

solution2
1 2020-04-29 17:21:22

update for the input change:

Shift values in column depending on next value and fill empty entries

Question

2 answers

solution1 2 2020-04-29 17:17:11

solution2 1 2020-04-29 17:21:22

update for the input change:

solution1
2 2020-04-29 17:17:11

solution2
1 2020-04-29 17:21:22