简体   繁体   English

根据下一个值移动列中的值并填充空条目

[英]Shift values in column depending on next value and fill empty entries

I have a data wrangling problem I am not sure how to solve.我有一个数据争论的问题,我不知道如何解决。 I have a dataframe in which the rows on one of the columns are both shifted up and this column is not completely filled.我有一个 dataframe ,其中一列上的行都向上移动,并且该列没有完全填充。 I need to shift the rows down and fill X number of rows, depending on how much data there is in the other columns.我需要将行向下移动并填充 X 行,具体取决于其他列中有多少数据。

EDIT : I have changed how I displayed the data.编辑:我改变了我显示数据的方式。 Before, I had pasted as a markdown table, which induced people in error.之前我贴的是markdown表,引人误会。 I am sorry for that.我为此感到抱歉。 The data I am dealing with looks like this:我正在处理的数据如下所示:

code    IdGene  Type    COGgene PosLeft postRight   Strand  Function
1
    1075082 CDS ROG0189 93  710 +   NA
8
    1075089 CDS COG0226 5632    6741    +   [P] ABC-type phosphate transport system, periplasmic component
    1075103 CDS NA  6796    7869    +   NA
9
    1075105 CDS NA  8075    8923    +   NA
    1075096 CDS ROG0189 8983    10149   +   NA
    1071820 CDS NA  10181   10723   +   NA
10
    1071880 CDS COG0642 10893   13316   +   [T] Signal transduction histidine kinase
    1072052 CDS COG2204 13288   14586   +   [T] Response regulator containing CheY-like receiver, AAA-type
12
    1075092 CDS NA  15525   16472   +   NA
13
    1075087 CDS NA  16655   17371   +   NA
    1074837 CDS NA  17383   17703   +   NA
    1071956 CDS NA  17710   18168   +   NA
14
    1071684 CDS NA  18251   18919   -   NA
15
    1075519 CDS ROG5478 19044   19334   +   NA
27
    1075067 CDS ROG8331 35989   36417   +   NA
    1075056 CDS COG2244 36478   38019   +   [R] Membrane protein involved in the export
    1075546 CDS COG1035 38016   39218   +   [C] Coenzyme F420-reducing hydrogenase, beta subunit
    1074004 CDS ROG1263 39215   40375   +   NA
    1075083 CDS COG1701 40406   40582   +   [S] Uncharacterized protein conserved in archaea
    1075068 CDS COG0463 40593   41537   +   [M] Glycosyltransferases involved in cell wall biogenesis
    1075064 CDS ROG2632 41534   42700   +   NA
    1075066 CDS COG0463 42724   43656   +   [M] Glycosyltransferases involved in cell wall biogenesis
    1075069 CDS COG1215 43671   44066   +   [M] Glycosyltransferases, probably involved in cell wall

And I need to transform it into this:我需要把它变成这样:

code    IdGene  Type    COGgene PosLeft postRight   Strand  Function
1   1075082 CDS ROG0189 93  710 +   NA
8   1075089 CDS COG0226 5632    6741    +   [P] ABC-type phosphate transport system, periplasmic component
8   1075103 CDS NA  6796    7869    +   NA
9   1075105 CDS NA  8075    8923    +   NA
9   1075096 CDS ROG0189 8983    10149   +   NA
9   1071820 CDS NA  10181   10723   +   NA
10  1071880 CDS COG0642 10893   13316   +   [T] Signal transduction histidine kinase
10  1072052 CDS COG2204 13288   14586   +   [T] Response regulator containing CheY-like receiver, AAA-type
12  1075092 CDS NA  15525   16472   +   NA
13  1075087 CDS NA  16655   17371   +   NA
13  1074837 CDS NA  17383   17703   +   NA
13  1071956 CDS NA  17710   18168   +   NA
14  1071684 CDS NA  18251   18919   -   NA
15  1075519 CDS ROG5478 19044   19334   +   NA
27  1075067 CDS ROG8331 35989   36417   +   NA
27  1075056 CDS COG2244 36478   38019   +   [R] Membrane protein involved in the export
27  1075546 CDS COG1035 38016   39218   +   [C] Coenzyme F420-reducing hydrogenase, beta subunit
27  1074004 CDS ROG1263 39215   40375   +   NA
27  1075083 CDS COG1701 40406   40582   +   [S] Uncharacterized protein conserved in archaea
27  1075068 CDS COG0463 40593   41537   +   [M] Glycosyltransferases involved in cell wall biogenesis
27  1075064 CDS ROG2632 41534   42700   +   NA
27  1075066 CDS COG0463 42724   43656   +   [M] Glycosyltransferases involved in cell wall biogenesis
27  1075069 CDS COG1215 43671   44066   +   [M] Glycosyltransferases, probably involved in cell wall

Any ideas pointers on how to solve this would be great.关于如何解决这个问题的任何想法都会很棒。 Ideally in R, but awk or others fine too.理想情况下在 R 中,但 awk 或其他也很好。

In case you are ok with formatting of output(means columns spaces) then you could try following in awk, also considering that you are reading data from an Input_file.如果您对输出格式(表示列空格)没问题,那么您可以尝试在 awk 中进行操作,同时考虑到您正在从 Input_file 读取数据。

awk '
BEGIN{
  OFS="\t"
}
FNR==1 || FNR==2{
  print
  next
}
$2~/[0-9]+/{
  value=$2
  next
}
{
  $2=value"    | "}
1
'  Input_file

This oneliner gives the expected result:这个 oneliner 给出了预期的结果:

awk -F '|' '1*$2{id=$2;next}NR<3||sub(/\s+/,id)' input

If the f contains your input data:如果f包含您的输入数据:

$ awk -F '|' '1*$2{id=$2;next}NR<3||sub(/\s+/,id)' f
| code | IdGene  | Type | COGgene | PosLeft | postRight | Strand | Function |
|------|---------|------|---------|---------|-----------|--------|----------|
| 1    | 1075082 | CDS  | ROG0189 | 93      | 710       | +      | NA       |
| 2    | 1075099 | CDS  | NA      | 783     | 1778      | +      | NA       |
| 3    | 1073305 | CDS  | NA      | 1872    | 2648      | +      | NA       |
| 4    | 1075537 | CDS  | NA      | 2783    | 3451      | +      | NA       |
| 4    | 1074931 | CDS  | COG0186 | 3460    | 3996      | +      | KO       |
| 5    | 1075097 | CDS  | NA      | 4088    | 4534      | +      | NA       |
| 5    | 1074010 | CDS  | NA      | 4457    | 4849      | -      | NA       |
| 5    | 1075093 | CDS  | ROG5695 | 4958    | 5503      | +      | NA       |
| 5    | 1075089 | CDS  | COG0226 | 5632    | 6741      | +      | KO       |
| 5    | 1075103 | CDS  | NA      | 6796    | 7869      | +      | NA       |
| 5    | 1075105 | CDS  | NA      | 8075    | 8923      | +      | NA       |
| 5    | 1075096 | CDS  | ROG0189 | 8983    | 10149     | +      | NA       |
| 5    | 1071820 | CDS  | NA      | 10181   | 10723     | +      | NA       |

update for the input change:输入更改的更新:

This one-liner will work for the new input and keep the output format:此单行代码适用于新输入并保持 output 格式:

awk  'NF<2{id=$1;next}NR==1||sub("\\s{"length(id)"}",id)' file

Test again with the input data in f :使用f中的输入数据再次测试:

$ awk  'NF<2{id=$1;next}NR==1||sub("\\s{"length(id)"}",id)' f
code    IdGene  Type    COGgene PosLeft postRight   Strand  Function
1   1075082 CDS ROG0189 93  710 +   NA
8   1075089 CDS COG0226 5632    6741    +   [P] ABC-type phosphate transport system, periplasmic component
8   1075103 CDS NA  6796    7869    +   NA
9   1075105 CDS NA  8075    8923    +   NA
9   1075096 CDS ROG0189 8983    10149   +   NA
9   1071820 CDS NA  10181   10723   +   NA
10  1071880 CDS COG0642 10893   13316   +   [T] Signal transduction histidine kinase
10  1072052 CDS COG2204 13288   14586   +   [T] Response regulator containing CheY-like receiver, AAA-type
12  1075092 CDS NA  15525   16472   +   NA
13  1075087 CDS NA  16655   17371   +   NA
13  1074837 CDS NA  17383   17703   +   NA
13  1071956 CDS NA  17710   18168   +   NA
14  1071684 CDS NA  18251   18919   -   NA
15  1075519 CDS ROG5478 19044   19334   +   NA
27  1075067 CDS ROG8331 35989   36417   +   NA
27  1075056 CDS COG2244 36478   38019   +   [R] Membrane protein involved in the export
27  1075546 CDS COG1035 38016   39218   +   [C] Coenzyme F420-reducing hydrogenase, beta subunit
27  1074004 CDS ROG1263 39215   40375   +   NA
27  1075083 CDS COG1701 40406   40582   +   [S] Uncharacterized protein conserved in archaea
27  1075068 CDS COG0463 40593   41537   +   [M] Glycosyltransferases involved in cell wall biogenesis
27  1075064 CDS ROG2632 41534   42700   +   NA
27  1075066 CDS COG0463 42724   43656   +   [M] Glycosyltransferases involved in cell wall biogenesis
27  1075069 CDS COG1215 43671   44066   +   [M] Glycosyltransferases, probably involved in cell

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM