简体   繁体   中英

Creating a new variable from the ranges of another column in which the ranges change - R

I am a beginner in R so sorry if it is a very simple question. I looked but I could not find the same problem.

I want to create a new variable from the ranges of another column in R but the ranges are not the same for each row.

To be more specific, my data has years 1960 - 2000 and i have ranges for employment. For 1960 to 1980 a teacher is 1 and a lawyer is 2 etc. For 1980 - 1990 a teacher is in the value range 1-29 and lawyer is 50-89 etc. Then finally for 1990-2000, the value range for the teacher is 40-65 and for the lawyer it is 1-39.

I dont even know how to begin with it (teacher and lawyer are not the only occupations there are 10 different occupations with overlapping value ranges for different years - which makes it very confusing for me).

I would appreciate your help. Thank you very much.

Here are a couple of approaches to get you started.

First, say you have a data frame with year and occupation_code :

df1 <- data.frame(
  year = c(1965, 1985, 1995),
  occupation_code = c(1, 2, 3)
)

  year occupation_code
1 1965               1
2 1985               2
3 1995               3

Then, create a second data frame which will clearly indicate the year ranges and occupation code ranges with each occupation. You can include all of your occupations here.

df2 <- data.frame(
  year_start = c(1960, 1960, 1980, 1980, 1990, 1990),
  year_end = c(1980, 1980, 1990, 1990, 2000, 2000),
  occupation_code_start = c(1, 2, 1, 50, 40, 1),
  occupation_code_end = c(1, 2, 29, 89, 65, 39),
  occupation = c("teacher", "lawyer", "teacher", "lawyer", "teacher", "lawyer")
)

  year_start year_end occupation_code_start occupation_code_end occupation
1       1960     1980                     1                   1    teacher
2       1960     1980                     2                   2     lawyer
3       1980     1990                     1                  29    teacher
4       1980     1990                    50                  89     lawyer
5       1990     2000                    40                  65    teacher
6       1990     2000                     1                  39     lawyer

Then, you can merge the two together.

One approach is with data.table package.

library(data.table)

setDT(df1)
setDT(df2)

df2[df1,
    on = .(year_start <= year, 
           year_end >= year, 
           occupation_code_start <= occupation_code, 
           occupation_code_end >= occupation_code),
    .(year, occupation = occupation)]

This will give you:

   year occupation
1: 1965    teacher
2: 1985    teacher
3: 1995     lawyer

Another approach is with fuzzyjoin and tidyverse :

library(tidyverse)
library(fuzzyjoin)

fuzzy_left_join(df1, df2,
                by = c("year" = "year_start",
                       "year" = "year_end",
                       "occupation_code" = "occupation_code_start",
                       "occupation_code" = "occupation_code_end"),
                match_fun = list(`>=`, `<=`, `>=`, `<=`)) %>%
  select(year, occupation)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM