简体   繁体   中英

How to Create Repeating Values for All Unique Group Values in a Column in R

I'm trying to make a column sub_species based on a condition from the Species column.

There are three unique values for Species . If Species starts with setosa , then I'd like to repeat setosa1 and setosa2 25 times respectively inside the new column sub_species . The same logic goes for the other two.

Note that each Species value has exactly 50 values, respectively. Hence, the length matches when 25 repetition is used.

library(dplyr)

iris %>% 
  mutate(
    sub_species = case_when(
      startsWith(as.character(Species), "setosa") ~ rep(c("setosa1", "setosa2"), length(1:25))
      startsWith(as.character(Species), "versicolor") ~ rep(c("versicolor1", "versicolor"), length(1:25)),
      startsWith(as.character(Species), "virginica") ~ rep(c("virginica1", "virginica2"), length(1:25))
    )
  )

Error: must be length 150 or one, not 50.

I tried with just setosa separately and it worked. However, it doesn't work when I want to do it as a whole.

You were close: instead of length(1:25) (which many not work as intended), use length.out . It has the added safeguard (in general, not with iris ) in ensuring that when you have an odd number of rows, you can produce the perfect amount of sub-species.

iris %>% 
  mutate(
    sub_species = case_when(
      startsWith(as.character(Species), "setosa") ~ rep(c("setosa1", "setosa2"), length.out = n()),
      startsWith(as.character(Species), "versicolor") ~ rep(c("versicolor1", "versicolor2"), length.out = n()),
      startsWith(as.character(Species), "virginica") ~ rep(c("virginica1", "virginica2"), length.out = n())
    )
  ) %>%
  head()
#    Sepal.Length Sepal.Width Petal.Length Petal.Width Species sub_species
# 1          5.1         3.5          1.4         0.2  setosa     setosa1
# 2          4.9         3.0          1.4         0.2  setosa     setosa2
# 3          4.7         3.2          1.3         0.2  setosa     setosa1
# 4          4.6         3.1          1.5         0.2  setosa     setosa2
# 5          5.0         3.6          1.4         0.2  setosa     setosa1
# 6          5.4         3.9          1.7         0.4  setosa     setosa2

base R

paste0 by itself might be okay, but like in the dplyr example above, this code will be a little safer if there are not an even number of rows.

iris$sub_species <- paste0(
  iris$Species,
  ave(seq_len(nrow(iris)), iris$Species,
      FUN = function(z) rep(1:2, length.out = length(z)))
)
head(iris)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species sub_species
# 1          5.1         3.5          1.4         0.2  setosa     setosa1
# 2          4.9         3.0          1.4         0.2  setosa     setosa2
# 3          4.7         3.2          1.3         0.2  setosa     setosa1
# 4          4.6         3.1          1.5         0.2  setosa     setosa2
# 5          5.0         3.6          1.4         0.2  setosa     setosa1
# 6          5.4         3.9          1.7         0.4  setosa     setosa2

The seq_len(nrow(iris)) is because ave requires the return-value to be the same class as the first argument; since we want numbers, I gave it numbers. We don't care what they are, but they must be the same length. (I could have used one of the numeric columns, but I wanted my intentions here clear.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM