简体   繁体   English

根据子字符串从无组织的数据中创建列

[英]Creating columns from unorganized data based on substrings

I faced the following problem with my thesis data. 我的论文数据面临以下问题。 I have a data frame with horizontally unorganized string-cells after the first column "id". 在第一列“ id”之后,我有一个带有水平无序字符串单元的数据框。 I want to organize strings within row, so that all strings beginning with the identical first 4 characters would stay in the same column. 我想在行内组织字符串,以便所有以相同的前4个字符开头的字符串都将留在同一列中。

Since there is a limited amount of relevant categories (less than 20), I could do this manually, first for "Arra", then for "Comm" and so on. 由于相关类别数量有限(少于20个),因此我可以手动执行此操作,首先是“ Arra”,然后是“ Comm”,依此类推。 I tried this with grepl but failed to return the original string of cell. 我用grepl尝试了此操作, grepl返回单元格的原始字符串。 I got only TRUE/FALSE. 我只有TRUE / FALSE。 I would appreciate your help a lot! 非常感谢您的帮助!

My current data looks like this. 我当前的数据如下所示。 (I left NA cells empty) (我将NA细胞留空)

id  col2              col3               col4         col5
3   Commitment 100    Lead Mgmt 15      Arranger 50
8   Arrangement 20    Front-end 80
16  Lead mgmt 40      Commitmnt 20
17
20  Arranger 50     

And this is what it should look like: 这就是它的样子:

id  Arra           Comm            Fron         Lead
3   Arranger 50    Commitment 100               Lead Mgmt 15
8   Arrangement 20                 Front-end 80
16                 Commitmnt 20                 Lead mgmt 40
17
20  Arranger 50

Here's one possible approach: 这是一种可能的方法:

library(data.table)
dcast(melt(as.data.table(mydf), "id", na.rm = TRUE)[value != ""][
  , ind := substr(value, 1, 4)], id ~ ind, value.var = "value", fill = "")
#    id           Arra           Comm         Fron         Lead
# 1:  3    Arranger 50 Commitment 100              Lead Mgmt 15
# 2:  8 Arrangement 20                Front-end 80             
# 3: 16                  Commitmnt 20              Lead mgmt 40
# 4: 20    Arranger 50   

And, with similar logic, in the "tidyverse": 并且,以类似的逻辑,在“ tidyverse”中:

library(tidyverse)
mydf[is.na(mydf)] <- ""
mydf %>%
  gather(var, val, starts_with("col")) %>%
  filter(val != "") %>%
  mutate(ind = substr(val, 1, 4)) %>%
  select(-var) %>%
  spread(ind, val)
#   id           Arra           Comm         Fron         Lead
# 1  3    Arranger 50 Commitment 100         <NA> Lead Mgmt 15
# 2  8 Arrangement 20           <NA> Front-end 80         <NA>
# 3 16           <NA>   Commitmnt 20         <NA> Lead mgmt 40
# 4 20    Arranger 50           <NA>         <NA>         <NA>

Sample data: 样本数据:

mydf <- structure(list(id = c(3L, 8L, 16L, 17L, 20L), col2 = c("Commitment 100", 
    "Arrangement 20", "Lead mgmt 40", "", "Arranger 50"), col3 = c("Lead Mgmt 15", 
    "Front-end 80", "Commitmnt 20", "", ""), col4 = c("Arranger 50", 
    "", "", "", ""), col5 = c(NA, NA, NA, NA, NA)), .Names = c("id", 
    "col2", "col3", "col4", "col5"), row.names = c(NA, 5L), class = "data.frame")

If there are duplicated stubs in your original data, for example, if "col5" in row 1 had another "commitment" value: 例如,如果原始数据中有重复的存根,则如果第1行中的“ col5”具有另一个“ commitment”值:

mydf$col5[1] <- "Commitment 99"

you can try something like this: 您可以尝试这样的事情:

dcast(melt(as.data.table(mydf), "id", na.rm = TRUE)[value != ""][
  , ind := substr(value, 1, 4)], 
  id ~ ind + rowid(id, ind), value.var = "value", fill = "")
#    id         Arra_1         Comm_1        Comm_2       Fron_1       Lead_1
# 1:  3    Arranger 50 Commitment 100 Commitment 99              Lead Mgmt 15
# 2:  8 Arrangement 20                              Front-end 80             
# 3: 16                  Commitmnt 20                            Lead mgmt 40
# 4: 20    Arranger 50                                                       

or this: 或这个:

dcast(melt(as.data.table(mydf), "id", na.rm = TRUE)[value != ""][
  , ind := substr(value, 1, 4)], 
  id ~ ind, value.var = "value", fun = function(x) x[1], fill = "")
#    id           Arra           Comm         Fron         Lead
# 1:  3    Arranger 50 Commitment 100              Lead Mgmt 15
# 2:  8 Arrangement 20                Front-end 80             
# 3: 16                  Commitmnt 20              Lead mgmt 40
# 4: 20    Arranger 50                                         

depending on your desired output. 取决于您所需的输出。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM