[英]Separate rows in multiple columns with differing number of commas
I use R and I have a dataframe with 3 columns that contains values separed with ",".我使用R并且我有一个 dataframe 有 3 列,其中包含用“,”分隔的值。
Here's what it looks like:这是它的样子:
col_A可乐 | col_B col_B | col_C col_C |
---|---|---|
first_name,last_name,age名字,姓氏,年龄 | John,Appleseed,23约翰,苹果籽,23 | Steve,Jobs, 33史蒂夫,乔布斯,33 |
I want each value separed by a comma to create a new row for this value.我希望用逗号分隔的每个值为此值创建一个新行。 So it should look like this:所以它应该是这样的:
col_A可乐 | col_B col_B | col_C col_C |
---|---|---|
first_name名 | John约翰 | Steve史蒂夫 |
last_name姓 | Appleseed苹果籽 | Jobs工作 |
age年龄 | 23 23 | 33 33 |
I have succeeded to perform it by doing like this:我通过这样做成功地执行了它:
col_A<- strsplit(df$col_A, split = ",")
col_B<- strsplit(df$col_B, split = ",")
col_C<- strsplit(df$col_C, split = ",")
df2<-data.frame(col_A= unlist(col_A),
col_B=unlist(col_B),
col_C=unlist(col_C))
the problem is the table is messy: sometimes I have different number of commas, so when I use str split, I don't have the same number of elements in my lists and the data.frame() function will not work if there isn't the same number of elements.问题是表格很乱:有时我有不同数量的逗号,所以当我使用 str split 时,我的列表中没有相同数量的元素,并且 data.frame() function 将不起作用,如果有'不相同数量的元素。 To illustrate, sometimes I will have 3 elements separed by a comma in col_A, while there are 4 commas in col_B and col_C.为了说明,有时我会在 col_A 中用逗号分隔 3 个元素,而在 col_B 和 col_C 中有 4 个逗号。 And vice versa.反之亦然。 Here's an example:这是一个例子:
col_A可乐 | col_B col_B | col_C col_C |
---|---|---|
first_name,last_name,age名字,姓氏,年龄 | John,Appleseed,23,约翰,Appleseed,23, | Steve,Jobs, 33,史蒂夫,乔布斯,33 岁, |
How can I do to get rid of this problem of formatting?我该怎么做才能摆脱这种格式问题? Adding commas before using str_split don't seem like a good solution to me.在使用 str_split 之前添加逗号对我来说似乎不是一个好的解决方案。
You can use str_remove()
across al columns to get rid of the ending commas.您可以在所有列中使用str_remove()
来去掉结尾的逗号。 Then you can separate_rows()
to get what you want.然后你可以separate_rows()
来得到你想要的。 This will not affect the output in rows without ending commas.这不会影响没有逗号结尾的行中的 output。
library(tidyverse)
df1 <- tibble::tribble(
~col_A, ~col_B, ~col_C,
"first_name,last_name,age", "John,Appleseed,23", "Steve,Jobs, 33"
)
df2 <- tibble::tribble(
~col_A, ~col_B, ~col_C,
"first_name,last_name,age", "John,Appleseed,23,", "Steve,Jobs, 33,"
)
df1 %>%
mutate(across(.fns = ~str_remove(.x, ",$"))) %>%
separate_rows(everything(), sep = ",")
#> # A tibble: 3 x 3
#> col_A col_B col_C
#> <chr> <chr> <chr>
#> 1 first_name John "Steve"
#> 2 last_name Appleseed "Jobs"
#> 3 age 23 " 33"
df2 %>%
mutate(across(.fns = ~str_remove(.x, ",$"))) %>%
separate_rows(everything(), sep = ",")
#> # A tibble: 3 x 3
#> col_A col_B col_C
#> <chr> <chr> <chr>
#> 1 first_name John "Steve"
#> 2 last_name Appleseed "Jobs"
#> 3 age 23 " 33"
Created on 2021-03-02 by the reprex package (v0.3.0)由代表 package (v0.3.0) 于 2021 年 3 月 2 日创建
Maybe you can use regmatches
like below也许您可以使用regmatches
的正则匹配
list2DF(lapply(df, function(x) unlist(regmatches(x, gregexpr("\\w+", x)))))
which gives这使
col_A col_B col_C
1 first_name John Steve
2 last_name Appleseed Jobs
3 age 23 33
Data数据
> dput(df)
structure(list(col_A = "first_name,last_name,age", col_B = "John,Appleseed,23,,,,",
col_C = "Steve,Jobs, 33"), row.names = c(NA, -1L), class = "data.frame")
> df
col_A col_B col_C
1 first_name,last_name,age John,Appleseed,23,,,, Steve,Jobs, 33
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.