简体   繁体   English

R-将每个字母因子的字母数字字符观测值拆分为列,并为每个观测值分配数值

[英]R - split alphanumeric char observations with column for each letter factor with value of numeric for each observation

I am not quite sure how to best word the title for what I want to do. 我不太确定如何最好地用标题表达我想做的事情。

I have a data frame that looks like this: 我有一个看起来像这样的数据框:

 ID = c(1, 2, 3, 4, 5, 6, 7)
 observation = c("a2", NA, "b3", "c5", NA, "b", "a3")
 df <- data.frame(cbind(ID, observation))

 df

  ID observation
1  1          a2
2  2        <NA>
3  3          b3
4  4          c5
5  5        <NA>
6  6           b
7  7          a3

My desired output is a data frame that splits observations by numbers and letters, with a new column for each unique letter where each row contains the associated observation number for that letter. 我想要的输出是一个数据框架,该数据框架将观察值按数字和字母进行拆分,每个唯一字母都有一个新列,其中每一行都包含该字母的关联观察值。

The desired output should look like this: 所需的输出应如下所示:

desired_df <- data.frame(cbind(ID, a = c(2, NA, 0, 0, 0 , 0, 3), 
                                   b = c(0, NA, 3, 0, 0, 0, 0),
                                   c = c(0, NA, 0, 5, 0, 0, 0)))
desired_df

  ID  a  b  c
1  1  2  0  0
2  2 NA NA NA
3  3  0  3  0
4  4  0  0  5
5  5  0 NA NA
6  6  0  0  0
7  7  3  0  0

I've tried approaching this by splitting observations into letters and numbers with a regular expression and saving the result into a new column: 我尝试通过将观察结果分为带有正则表达式的字母和数字并将结果保存到新列中来解决此问题:

library(stringr)
char <- unlist(str_replace_all(observation, "[[:digit:]]", ""))
num <- unlist(str_extract(observation, "[[:digit:]]"))
df_new <- cbind(ID, char, num)
df_new

  ID char  num
1  1    a    2
2  2 <NA> <NA>
3  3    b    3
4  4    c    5
5  5 <NA> <NA>
6  6    b <NA>
7  7    a    3

Then tried converting char to a factors to a binary form based on the answer to this SO Question 然后尝试根据此SO问题的答案将char转换为因子成二进制形式

df_new <- data.frame(cbind(df, sapply(levels(as.factor((char))), 
function(x) as.integer(x == char))))

  ID char  num  a  b  c
1  1    a    2  1  0  0
2  2 <NA> <NA> NA NA NA
3  3    b    3  0  1  0
4  4    c    5  0  0  1
5  5 <NA> <NA> NA NA NA
6  6    b <NA>  0  1  0
7  7    a    3  1  0  0

I then tried to replace each 1 observation with the the corresponding value in df_new1$num for that row, based on the answer to this SO question : 然后,我根据此SO问题的答案,尝试用df_new1 $ num中该行的相应值替换每个1观察值:

df_new2 <- data.frame(with(df_new1, ifelse(df_new1 == 1, df_new1$num, 0)))

df_new2
  ID char num  a  b  c
1  1    0   0  1  0  0
2  0   NA  NA NA NA NA
3  0    0   0  0  2  0
4  0    0   0  0  0  3
5  0   NA  NA NA NA NA
6  0    0  NA  0 NA  0
7  0    0   0  2  0  0

Which outputs the wrong result. 输出错误的结果。 I've been struggling to figure this out. 我一直在努力解决这个问题。 I am OK with all non 1 values being replaced with 0 as long as the values in columns a, b, c are correct. 我可以将所有非1值替换为0,只要a,b,c列中的值正确即可。

I'm not sure if splitting letters and numbers into separate columns, and trying to replace binary observations for letters as factors is even the best approach for trying to solve my original problem and am open to any approach that works. 我不确定是否将字母和数字分成不同的列,并尝试将字母的二进制观测值替换为因子甚至是尝试解决我的原始问题的最佳方法,并且对任何可行的方法都持开放态度。

My real data frame is generated by a script that extracts patterns from .txt files, where the alphanumeric observations vary from file to file. 我的真实数据帧是由一个脚本生成的,该脚本从.txt文件中提取模式,其中字母数字的观察值随文件的不同而不同。 I need something that will work for any unique letters that get assigned to the char column. 我需要一些对分配给char列的唯一字母起作用的东西。

I appreciate any advice or help on figuring this out as I am a novice to R. I'm still getting familiar with SO etiquette and would appreciate any comments on how to improve the question and/or reproducible example. 我是R的新手,因此感谢您提出的任何建议或帮助。我仍然非常熟悉SO礼节,并且希望对如何改善问题和/或可复制的示例提出任何意见。

You can use extract from tidyr to split observation into var and value column, then use spread to reshape the table. 您可以使用tidyr extract tidyrobservation分为varvalue列,然后使用spread调整表的tidyr Note that <NA> is now its own column because of the NA values in ID == 2 . 请注意,由于ID == 2中的NA值, <NA>现在是其自己的列。 A select gets rid of that column: select将删除该列:

library(dplyr)
library(tidyr)

df %>%
  extract(observation, c("var", "value"), regex = "([a-z])?(\\d)?") %>%
  spread(var, value) %>%
  select(-`<NA>`)

Result: 结果:

  ID    a    b    c
1  1    2 <NA> <NA>
2  2 <NA> <NA> <NA>
3  3 <NA>    3 <NA>
4  4 <NA> <NA>    5
5  5 <NA> <NA> <NA>
6  6    3 <NA> <NA>

Since you mentioned that non-digit values can be 0 or NA 既然您提到非数字值可以为0NA

library(tidyverse)
df %>%
  nest(-ID) %>%
  mutate(data = map(data, ~data.frame(key = gsub("\\d", "", unlist(.x)), val = gsub("\\D", "", unlist(.x))))) %>%
  unnest() %>%
  spread(key, val, fill = 0) %>%
  select(-ncol(.)) %>%
  replace(.=="", 0)

  # ID    a     b     c    
  # <fct> <chr> <chr> <chr>
# 1 1     2     0     0    
# 2 2     0     0     0    
# 3 3     0     3     0    
# 4 4     0     0     5    
# 5 5     0     0     0    
# 6 6     3     0     0    
# There were 14 warnings (use warnings() to see them)    

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM