簡體   English   中英

使用 R 將字符串拆分為多列而不是一列中的向量

[英]Using R to split a string into multiple column instead of a vector in one column

我想根據分配的分隔符號將數據集中的一列拆分為多列:“|”。

我的數據集如下所示:

vname<-c("x1", "x2", "x3","x4")
label<-c("1,Eng |2,Man", "1,yes|2,no|3,dont know", "1,never|2,sometimes|3,usually|4,always", "1,yes|2,No|3,dont know")
df<-data.frame(vname, label)

所以,我想根據符號“|”將 column: label 拆分為多個列。我使用 stringr::str_split 來做到這一點,我的代碼如下:

cd2<-df %>%
  select(vname, everything())%>%
  mutate(label=str_split(value, " \\| "))

但是,結果在標簽列中返回一個向量。 它看起來像這樣:

vname  label
x1     c("1,Eng","2,Man")
x2     c("1,yes","2,no", "3,dont know")
....

我的問題是如何獲得這樣的預期結果:

vname  label1   label2      label3          label4
x1     1,Eng    2,Man
x2     1,yes    2,no         3, dont know
x3     1,never. 2,sometimes, 3,usually.     4,always
...

非常感謝幫助~~~


dput(head(cd2, 10)) 
structure(list(variable = c("x2", "x8", "x9", "x10", "x13", "x14", 
"x15", "x20", "x22", NA), vname = c("consenting_language", "county", 
"respondent", "residence", "language", "int_q1", "int_q2", "int_q4", 
"int_q5", "int_q6"), label = c("Consenting Language", "County", 
"Respondent Type", "Residence", "Interview language ", "1. What was your sex at birth?", 
"2. How would you describe your current sexual orientation?", 
"4. What is the highest level of education you completed?", "5. What is your current marital status?", 
"<div class=\"rich-text-field-label\"><p>6. Is <span style=\"color: #3598db;\">regular </span>your partner currently living with you now, or does s/he stay elsewhere?</p></div>"
), value = c("1, English | 2, Kiswahili", "1, County011 | 2, County014  | 3, County002| 4, County006  | 5, County010 | 6, County008  | 7, County005  | 8, County003 | 9, County012| 10, County004 | 11, County009  | 12, County001 | 13, County015 | 14, County007 | 15, County012", 
"1, FSW | 2, MSM | 3, AGYW", "1, Urban | 2, Peri urban | 3, Rural", 
"1, English | 2, Kiswahili", "1, Male | 2, Female", "1, Homosexual/Gay | 2, Bisexual | 3, Heterosexual/Straight | 4, Transgender Male | 5, Transgender Female | 96, Other | 98, Don't Know | 99, Decline to state", 
"1,None  | 2,Nursery/kindergarten | 3,Primary | 4,Secondary | 5,Tertiary/Vocational | 6,College/University | 7,Adult education | 96,Other", 
"1, Single/Not married | 2, Married | 3, Cohabiting | 4, Divorced | 5, Separated | 6, Widowed | 7, In a relationship", 
"1, Living with You | 2, Staying Elsewhere")), row.names = c(NA, 
10L), class = "data.frame")

使用所使用的代碼,它返回一個list (也許我們必須確保有零個或多個空格,因為在示例中沒有空格),我們可以unnest_wider轉換為新列

library(dplyr)
library(stringr)
library(tidyr)
df %>%
  select(vname, everything())%>%
  mutate(label=str_split(label, "\\s*\\|\\s*")) %>% 
  unnest_wider(where(is.list), names_sep = "")

-輸出

# A tibble: 4 × 5
  vname label1  label2      label3      label4  
  <chr> <chr>   <chr>       <chr>       <chr>   
1 x1    1,Eng   2,Man       <NA>        <NA>    
2 x2    1,yes   2,no        3,dont know <NA>    
3 x3    1,never 2,sometimes 3,usually   4,always
4 x4    1,yes   2,No        3,dont know <NA>    

這也可以通過separate

library(tidyr)
 df %>%
   separate(label, into = str_c('label', 
   seq_len(max(str_count(.$label, fixed("|"))) + 1)), 
      sep = "\\|", fill = "right")

-輸出

 vname  label1      label2      label3   label4
1    x1  1,Eng        2,Man        <NA>     <NA>
2    x2   1,yes        2,no 3,dont know     <NA>
3    x3 1,never 2,sometimes   3,usually 4,always
4    x4   1,yes        2,No 3,dont know     <NA>

或使用 OP 的數據 'cd2' - 在|之前和之后添加空格的大小寫。

cd2new <- cd2 %>% 
   separate(value, into = str_c('value', 
   seq_len(max(str_count(.$value, fixed("|"))) + 1)), 
      sep = "\\s*\\|\\s*", fill = "right")

-輸出

> head(cd2new, 2)
  variable               vname               label       value1       value2       value3       value4       value5
1       x2 consenting_language Consenting Language   1, English 2, Kiswahili         <NA>         <NA>         <NA>
2       x8              county              County 1, County011 2, County014 3, County002 4, County006 5, County010
        value6       value7       value8       value9       value10       value11       value12       value13
1         <NA>         <NA>         <NA>         <NA>          <NA>          <NA>          <NA>          <NA>
2 6, County008 7, County005 8, County003 9, County012 10, County004 11, County009 12, County001 13, County015
        value14       value15
1          <NA>          <NA>
2 14, County007 15, County012

你可以簡單地通過使用 {tidyr} 中的separate()來做到這一點

library(tidyverse)

dat %>% as_tibble() %>% 
  separate(value, sep = "\\s*\\|\\s*", 
           into = paste0("value", seq(str_count(.$value, "\\s*\\|\\s*"))))


暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM