简体   繁体   中英

How can I convert a non-numeric variable column into two numeric variable columns?

Using R, I need help converting a non-numeric column into two numeric ones. I want to split the non-numeric data in column x, with the value before the dash going into one column (Start) and the value after the dash going into another column (End). Then, I want to create a new numeric column containing the difference between the Start and End columns with 1 added to the difference. (The Diff column contains a year count, so from 2011 to 2018 will be eight years.)

I encountered unexpected problems when I tried to do it. First, the x variable displayed as a factor. Second, the data in the Start and End columns were not numeric and when I tried to make them numeric so the Diff calculation could occur, I got a coercion error. Third, I could not get strsplit to work.

I checked stackoverflow solutions for comparable problems, but was unable to find one that presented a solution that worked for me.

The input data is just a very small sample of what is in the actual file

I would prefer a solution that uses dplyr, but am open to other ones.

Input

dput(df)
structure(list(x = c(NA, "1950-1960", "1975-1986", "2011-2018"
)), class = "data.frame", row.names = c(NA, -4L))

Output

x          Start  End   Diff
1950-1960  1950   1960  11
1975-1986  1975   1986  12
2011-2018  2011   2018   8
df$Start = as.numeric(unlist(lapply(strsplit(df$x, "-"), `[`, 1)))
df$End   = as.numeric(unlist(lapply(strsplit(df$x, "-"), `[`, 2)))
df$Diff  = df$End - df$Start + 1
df
          x Start  End Diff
1      <NA>    NA   NA   NA
2 1950-1960  1950 1960   11
3 1975-1986  1975 1986   12
4 2011-2018  2011 2018    8

G5W's is great for base R, here's a "tidyverse" version:

library(dplyr)
library(tidyr) # separate
df %>%
  filter(!is.na(x)) %>%
  tidyr::separate(x, into = c("Start", "End"), sep = "-", remove = FALSE, convert = TRUE) %>%
  mutate(Diff = End - Start + 1L)
#           x Start  End Diff
# 1 1950-1960  1950 1960   11
# 2 1975-1986  1975 1986   12
# 3 2011-2018  2011 2018    8

A quick but inflexible solutions is to grab the years by position using substr() :

df$Start <- as.numeric(substr(df$x, 1, 4))
df$End <- as.numeric(substr(df$x, 6, 10))
df$Diff <- df$End - df$Start + 1

df[!is.na(df$Diff), ]
          x Start  End Diff
2 1950-1960  1950 1960   11
3 1975-1986  1975 1986   12
4 2011-2018  2011 2018    8

Yet another baseR solution:

df1[, c("Start", "End")] <- do.call(rbind, strsplit(df1$x, "-"))
df1 <- transform(type.convert(df1), Diff = End - Start + 1)

Result

df1
#          x Start  End Diff
#1      <NA>    NA   NA   NA
#2 1950-1960  1950 1960   11
#3 1975-1986  1975 1986   12
#4 2011-2018  2011 2018    8

data

df1 <- structure(list(x = c(NA, "1950-1960", "1975-1986", "2011-2018"
)), class = "data.frame", row.names = c(NA, -4L))

base R, easy to read

#your data
x <- c("1950-1960", "1975-1986", "2011-2018")
df <- as.data.frame(x)

#code
df_list <- unlist(apply(df, MARGIN = 1, strsplit, "-"))
new_data <- matrix(df_list, ncol = 2,byrow = T)

#output
output <- cbind(df,new_data)

Output:

          x    1    2
1 1950-1960 1950 1960
2 1975-1986 1975 1986
3 2011-2018 2011 2018

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM