简体   繁体   中英

Shift values in column depending on next value and fill empty entries

I have a data wrangling problem I am not sure how to solve. I have a dataframe in which the rows on one of the columns are both shifted up and this column is not completely filled. I need to shift the rows down and fill X number of rows, depending on how much data there is in the other columns.

EDIT : I have changed how I displayed the data. Before, I had pasted as a markdown table, which induced people in error. I am sorry for that. The data I am dealing with looks like this:

code    IdGene  Type    COGgene PosLeft postRight   Strand  Function
1
    1075082 CDS ROG0189 93  710 +   NA
8
    1075089 CDS COG0226 5632    6741    +   [P] ABC-type phosphate transport system, periplasmic component
    1075103 CDS NA  6796    7869    +   NA
9
    1075105 CDS NA  8075    8923    +   NA
    1075096 CDS ROG0189 8983    10149   +   NA
    1071820 CDS NA  10181   10723   +   NA
10
    1071880 CDS COG0642 10893   13316   +   [T] Signal transduction histidine kinase
    1072052 CDS COG2204 13288   14586   +   [T] Response regulator containing CheY-like receiver, AAA-type
12
    1075092 CDS NA  15525   16472   +   NA
13
    1075087 CDS NA  16655   17371   +   NA
    1074837 CDS NA  17383   17703   +   NA
    1071956 CDS NA  17710   18168   +   NA
14
    1071684 CDS NA  18251   18919   -   NA
15
    1075519 CDS ROG5478 19044   19334   +   NA
27
    1075067 CDS ROG8331 35989   36417   +   NA
    1075056 CDS COG2244 36478   38019   +   [R] Membrane protein involved in the export
    1075546 CDS COG1035 38016   39218   +   [C] Coenzyme F420-reducing hydrogenase, beta subunit
    1074004 CDS ROG1263 39215   40375   +   NA
    1075083 CDS COG1701 40406   40582   +   [S] Uncharacterized protein conserved in archaea
    1075068 CDS COG0463 40593   41537   +   [M] Glycosyltransferases involved in cell wall biogenesis
    1075064 CDS ROG2632 41534   42700   +   NA
    1075066 CDS COG0463 42724   43656   +   [M] Glycosyltransferases involved in cell wall biogenesis
    1075069 CDS COG1215 43671   44066   +   [M] Glycosyltransferases, probably involved in cell wall

And I need to transform it into this:

code    IdGene  Type    COGgene PosLeft postRight   Strand  Function
1   1075082 CDS ROG0189 93  710 +   NA
8   1075089 CDS COG0226 5632    6741    +   [P] ABC-type phosphate transport system, periplasmic component
8   1075103 CDS NA  6796    7869    +   NA
9   1075105 CDS NA  8075    8923    +   NA
9   1075096 CDS ROG0189 8983    10149   +   NA
9   1071820 CDS NA  10181   10723   +   NA
10  1071880 CDS COG0642 10893   13316   +   [T] Signal transduction histidine kinase
10  1072052 CDS COG2204 13288   14586   +   [T] Response regulator containing CheY-like receiver, AAA-type
12  1075092 CDS NA  15525   16472   +   NA
13  1075087 CDS NA  16655   17371   +   NA
13  1074837 CDS NA  17383   17703   +   NA
13  1071956 CDS NA  17710   18168   +   NA
14  1071684 CDS NA  18251   18919   -   NA
15  1075519 CDS ROG5478 19044   19334   +   NA
27  1075067 CDS ROG8331 35989   36417   +   NA
27  1075056 CDS COG2244 36478   38019   +   [R] Membrane protein involved in the export
27  1075546 CDS COG1035 38016   39218   +   [C] Coenzyme F420-reducing hydrogenase, beta subunit
27  1074004 CDS ROG1263 39215   40375   +   NA
27  1075083 CDS COG1701 40406   40582   +   [S] Uncharacterized protein conserved in archaea
27  1075068 CDS COG0463 40593   41537   +   [M] Glycosyltransferases involved in cell wall biogenesis
27  1075064 CDS ROG2632 41534   42700   +   NA
27  1075066 CDS COG0463 42724   43656   +   [M] Glycosyltransferases involved in cell wall biogenesis
27  1075069 CDS COG1215 43671   44066   +   [M] Glycosyltransferases, probably involved in cell wall

Any ideas pointers on how to solve this would be great. Ideally in R, but awk or others fine too.

In case you are ok with formatting of output(means columns spaces) then you could try following in awk, also considering that you are reading data from an Input_file.

awk '
BEGIN{
  OFS="\t"
}
FNR==1 || FNR==2{
  print
  next
}
$2~/[0-9]+/{
  value=$2
  next
}
{
  $2=value"    | "}
1
'  Input_file

This oneliner gives the expected result:

awk -F '|' '1*$2{id=$2;next}NR<3||sub(/\s+/,id)' input

If the f contains your input data:

$ awk -F '|' '1*$2{id=$2;next}NR<3||sub(/\s+/,id)' f
| code | IdGene  | Type | COGgene | PosLeft | postRight | Strand | Function |
|------|---------|------|---------|---------|-----------|--------|----------|
| 1    | 1075082 | CDS  | ROG0189 | 93      | 710       | +      | NA       |
| 2    | 1075099 | CDS  | NA      | 783     | 1778      | +      | NA       |
| 3    | 1073305 | CDS  | NA      | 1872    | 2648      | +      | NA       |
| 4    | 1075537 | CDS  | NA      | 2783    | 3451      | +      | NA       |
| 4    | 1074931 | CDS  | COG0186 | 3460    | 3996      | +      | KO       |
| 5    | 1075097 | CDS  | NA      | 4088    | 4534      | +      | NA       |
| 5    | 1074010 | CDS  | NA      | 4457    | 4849      | -      | NA       |
| 5    | 1075093 | CDS  | ROG5695 | 4958    | 5503      | +      | NA       |
| 5    | 1075089 | CDS  | COG0226 | 5632    | 6741      | +      | KO       |
| 5    | 1075103 | CDS  | NA      | 6796    | 7869      | +      | NA       |
| 5    | 1075105 | CDS  | NA      | 8075    | 8923      | +      | NA       |
| 5    | 1075096 | CDS  | ROG0189 | 8983    | 10149     | +      | NA       |
| 5    | 1071820 | CDS  | NA      | 10181   | 10723     | +      | NA       |

update for the input change:

This one-liner will work for the new input and keep the output format:

awk  'NF<2{id=$1;next}NR==1||sub("\\s{"length(id)"}",id)' file

Test again with the input data in f :

$ awk  'NF<2{id=$1;next}NR==1||sub("\\s{"length(id)"}",id)' f
code    IdGene  Type    COGgene PosLeft postRight   Strand  Function
1   1075082 CDS ROG0189 93  710 +   NA
8   1075089 CDS COG0226 5632    6741    +   [P] ABC-type phosphate transport system, periplasmic component
8   1075103 CDS NA  6796    7869    +   NA
9   1075105 CDS NA  8075    8923    +   NA
9   1075096 CDS ROG0189 8983    10149   +   NA
9   1071820 CDS NA  10181   10723   +   NA
10  1071880 CDS COG0642 10893   13316   +   [T] Signal transduction histidine kinase
10  1072052 CDS COG2204 13288   14586   +   [T] Response regulator containing CheY-like receiver, AAA-type
12  1075092 CDS NA  15525   16472   +   NA
13  1075087 CDS NA  16655   17371   +   NA
13  1074837 CDS NA  17383   17703   +   NA
13  1071956 CDS NA  17710   18168   +   NA
14  1071684 CDS NA  18251   18919   -   NA
15  1075519 CDS ROG5478 19044   19334   +   NA
27  1075067 CDS ROG8331 35989   36417   +   NA
27  1075056 CDS COG2244 36478   38019   +   [R] Membrane protein involved in the export
27  1075546 CDS COG1035 38016   39218   +   [C] Coenzyme F420-reducing hydrogenase, beta subunit
27  1074004 CDS ROG1263 39215   40375   +   NA
27  1075083 CDS COG1701 40406   40582   +   [S] Uncharacterized protein conserved in archaea
27  1075068 CDS COG0463 40593   41537   +   [M] Glycosyltransferases involved in cell wall biogenesis
27  1075064 CDS ROG2632 41534   42700   +   NA
27  1075066 CDS COG0463 42724   43656   +   [M] Glycosyltransferases involved in cell wall biogenesis
27  1075069 CDS COG1215 43671   44066   +   [M] Glycosyltransferases, probably involved in cell

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM