[英]Convert a single column into multiple columns based on delimiter in R
I have the following dataframe:我有以下 dataframe:
ID Parts
-- -----
1 A:B::
2 X2:::
3 ::J4:
4 A:C:D:G4:X6
And I would like the convert the Parts column into multiple columns by the :
delimiter.我希望通过
:
分隔符将 Parts 列转换为多列。 so it should look like:所以它应该看起来像:
ID A B X2 J4 C D G4 X6 ........
-- - - -- -- - - -- --
1 A B na na na na na na
2 na na X2 na na na na na
3 na na na J4 na na na na
4 A na na na C D G4 X6
where there I would not know the number of potential columns in advance.在那里我不会提前知道潜在列的数量。
I have met my match on this one - strsplit() by delim I can do but only with fixed number of entities in the Parts
column我在这个上遇到了我的匹配 - strsplit() by delim 我可以做到,但只能在
Parts
列中使用固定数量的实体
You can use a combination of tidyr::seperate
, tidyr::pivot_wider
, and tidyr::pivot_longer
.您可以使用
tidyr::seperate
tidyr::pivot_wider
和tidyr::pivot_longer
的组合。 First you can still use strsplit
to determine the number of columns to split Parts
into not the number of unique values ( How it works ):首先,您仍然可以使用
strsplit
来确定将Parts Parts
为唯一值的数量而不是唯一值的数量(它是如何工作的):
library(dplyr)
library(tidyr)
library(stringr)
n_col <- max(stringr::str_count(df$Parts, ":")) + 1
df %>%
tidyr::separate(Parts, into = paste0("col", 1:n_col), sep = ":") %>%
dplyr::mutate(across(everything(), ~dplyr::na_if(., ""))) %>%
tidyr::pivot_longer(-ID) %>%
dplyr::select(-name) %>%
tidyr::drop_na() %>%
tidyr::pivot_wider(id_cols = ID,
names_from = value)
ID A B X2 J4 C D G4 X6
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 A B NA NA NA NA NA NA
2 2 NA NA X2 NA NA NA NA NA
3 3 NA NA NA J4 NA NA NA NA
4 4 A NA NA NA C D G4 X6
How it works这个怎么运作
You do not need to know the number of unique values with this code -- the pivots take care of that.使用此代码,您无需知道唯一值的数量——枢轴会处理这些。 What you do need to know is how many new columns
Parts
will be split into with seperate
.您需要知道的是
Parts
将拆分为多少个新列seperate
。 That's easy to do by counting the number of delimiters and adding one with str_count
.这很容易通过计算分隔符的数量并用
str_count
加一来实现。 This way you have the appropriate number of columns to seperate Parts
into by your delimiter.这样,您就有了适当数量的列,可以通过分隔符将
Parts
分开。
This is because pivot_longer
will create a two column dataframe with repeated ID
and a column with the delimited values of Parts
-- an ID
, Parts
pairing.这是因为
pivot_longer
将创建一个包含重复ID
的两列 dataframe 和一个带有Parts
分隔值的列——一个ID
, Parts
配对。 Then when you use pivot_wider
the columns are automatically created for each unique value of Parts
and the value is retained within the column.然后,当您使用
pivot_wider
时,将为Parts
的每个唯一值自动创建列,并且该值保留在列中。 This function automatically fills with NA
where an ID
and Parts
combination is not found.此 function 在未找到
ID
和Parts
组合的情况下自动填充NA
。
Try running this pipe by pipe to better understand if need be.尝试运行此 pipe 的 pipe 以更好地了解是否需要。
Data数据
lines <- "
ID Parts
1 A:B::
2 X2:::
3 ::J4:
4 A:C:D:G4:X6
"
df <- read.table(text = lines, header = T)
Could the seperate
function from tidyr
be what you are looking for?来自
tidyr
的seperate
function 是否是您正在寻找的?
https://tidyr.tidyverse.org/reference/separate.html https://tidyr.tidyverse.org/reference/separate.html
It might require some fancy regex implementation, but could potentially work.它可能需要一些花哨的正则表达式实现,但可能会起作用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.