简体   繁体   English

使用 tidyr 分隔具有多个不同条目的列

[英]Separating a column with multiple different entries with tidyr

I am trying to split up one column in a data frame that shows the period active(s) for several artists/ bands into two columns (start_of_career, end_of_career).我正在尝试将数据框中的一列拆分为两列(start_of_career、end_of_career),其中显示了几个艺术家/乐队的活跃时期。 The variable class is character.变量类是字符。 I tried to use tidyrs separate function for it and when I run it, I see that it is split in the console but not in the data frame itself, so I assume that it doesn't work properly.我尝试为它使用 tidyrs 单独的函数,当我运行它时,我看到它在控制台中拆分,但在数据框本身中没有拆分,所以我认为它不能正常工作。

Please see here a made up example of the data I want to split:请在此处查看我要拆分的数据的组成示例:

Column A A列 Column B B列
Artist A艺术家A 1995-present 1995年至今
Artist B艺术家乙 1995-1997, 2008, 2010-present 1995-1997, 2008, 2010-至今

As you can see, some rows will consists only of a start and end date, while others have several dates.如您所见,有些行仅包含开始日期和结束日期,而其他行则有多个日期。 All I actually need is the first number and the last, eg for Artist BI need only start_of_career 1995 and end_of_career "present".我真正需要的只是第一个和最后一个数字,例如对于 Artist BI 只需要 start_of_career 1995 和 end_of_career“present”。 But I am somehow not able to solve this issue.但我不知何故无法解决这个问题。

The code I used was:我使用的代码是:

library(tidyr)
df %>% separate(col = period_active, into = c('start_of_career', 'end_of_career'), sep = '-')

I also tried other separators(",", " "), but it didn't work either.我也尝试了其他分隔符(“,”,“”),但它也没有用。

I also tried:我也试过:

df$start_of_career = strsplit(df$period_active, split = '-')

But this didn't work as well.但这也不起作用。

Using df shown reproducibly in the Note at the end remove everything except first and last parts of Column B and then separate what is left.使用末尾注释中可重复显示的df删除除 B 列的第一部分和最后一部分之外的所有内容,然后分离剩下的内容。

library(dplyr)
library(tidyr)

dd %>%
  mutate(`Column B` = sub("-.*-", "-", `Column B`)) %>%
  separate(`Column B`, c("start", "end"))
##   Column A start     end
## 1 Artist A  1995 present
## 2 Artist B  1995 present

Note笔记

df <- 
structure(list(`Column A` = c("Artist A", "Artist B"), `Column B` = c("1995-present", 
"1995-1997, 2008, 2010-present")), class = "data.frame", row.names = c(NA, 
-2L))

Using base R使用base R

df <- cbind(df[1], read.table(text = sub("-[0-9, ]+", "", df$`Column B`),
    header = FALSE, col.names = c("start", "end"), sep = "-"))

-output -输出

> df
  Column A start     end
1 Artist A  1995 present
2 Artist B  1995 present

We could do this with separate as well我们也可以用separate来做到这一点

library(tidyr)
separate(df, `Column B`, into = c("start", "end"), sep = "-[^A-Za-z]*")
  Column A start     end
1 Artist A  1995 present
2 Artist B  1995 present

data数据

df <- structure(list(`Column A` = c("Artist A", "Artist B"), 
`Column B` = c("1995-present", 
"1995-1997, 2008, 2010-present")), class = "data.frame",
 row.names = c(NA, 
-2L))

We could use separate_rows and then filter for first and last row of group:我们可以使用separate_rows然后过滤组的第一行和最后一行:

library(tidyr)
library(dplyr)

df %>% 
  separate_rows(Column.B) %>% 
  group_by(Column.A) %>% 
  filter(row_number()==1 | row_number()==n()) %>% 
  mutate(Colum.C = c("start", "end"))
  Column.A Column.B Colum.C
  <chr>    <chr>    <chr>  
1 Artist A 1995     start  
2 Artist A present  end    
3 Artist B 1995     start  
4 Artist B present  end   

data:数据:

structure(list(Column.A = c("Artist A", "Artist B"), Column.B = c("1995-present", 
"1995-1997, 2008, 2010-present")), class = "data.frame", row.names = c(NA, 
-2L))

Using strsplit and then subsequently pick the first and the last entry.使用strsplit然后选择第一个和最后一个条目。

library(dplyr)

df %>% 
  rowwise() %>% 
  mutate(splitrow = strsplit(`Column B`, "-"), 
    start_of_career = splitrow[1], 
    end_of_career = splitrow[length(splitrow)], 
    splitrow = NULL) %>% 
  ungroup()
# A tibble: 2 × 4
  `Column A` `Column B`                    start_of_career end_of_career
  <chr>      <chr>                         <chr>           <chr>
1 Artist A   1995-present                  1995            present
2 Artist B   1995-1997, 2008, 2010-present 1995            present

Data数据

df <- structure(list(`Column A` = c("Artist A", "Artist B"), `Column B` = c("1995-present",
"1995-1997, 2008, 2010-present")), class = "data.frame", row.names = c(NA,
-2L))

Another option: use strsplit, and return the list of start and end values另一种选择:使用 strsplit,并返回开始值和结束值的列表

f <- \(v) {
  v = strsplit(v, "-|,| ")[[1]]
  list(start = v[1],end = v[length(v)])
}

df %>% 
  mutate(df, `Column B` = lapply(`Column B`,f)) %>%
  unnest_wider(`Column B`)

Output:输出:

# A tibble: 2 × 3
  `Column A` start end    
  <chr>      <chr> <chr>  
1 Artist A   1995  present
2 Artist B   1995  present

Below code extract the first word before the dash and last word after.下面的代码提取破折号之前的第一个单词和破折号之后的最后一个单词。

for(i in 1:length(df))
{
df$start[i] <-sub("-.*", "", df$`Column B`[i])
df$end[i] <-sub("^.+-", "", df$`Column B`[i])

}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM