[英]easier way to use grepl and ifelse across multiple columns
我有一個名為“ jobdata”的數據集
names <- c("person1", "person2", "person3")
job1_1_sector <- c("Private", "Public", "Private")
job2_1_sector <- c(NA, "Public", "Private")
job2_2_sector <- c("Private", "Public", "Other")
job3_1_sector <- c("Private", "Private", "Private")
job3_2_sector <- c("Other", "Public", "Other")
job3_3_sector <- c("Private", NA, "Private")
jobs <- cbind(job1_1_sector, job2_1_sector, job2_2_sector, job3_1_sector,
job3_2_sector, job3_3_sector )
jobdata <- data.frame(names, jobs)
我想創建一個新的二進制變量private
,如果單詞Private出現在相關變量(即job [123] _ [123] _sector)上,則等於1。 然后是另一個給Public
,另一個給Other
。 我已經想出了如何在ifelse和grepl中使用它,但是看起來我的代碼行確實很長。 有沒有更簡單的方法可以做到這一點?
下面的代碼為我提供了我想要的代碼:
jobdata$private <- ifelse(grepl("Private", jobdata$job1_1_sector) | grepl("Private", jobdata$job2_1_sector) | grepl("Private", jobdata$job2_2_sector) | grepl("Private", jobdata$job3_1_sector) | grepl("Private", jobdata$job3_2_sector) | grepl("Private", jobdata$job3_3_sector), 1, 0)
jobdata$public <- ifelse(grepl("Public", jobdata$job1_1_sector) | grepl("Public", jobdata$job2_1_sector) | grepl("Public", jobdata$job2_2_sector) | grepl("Public", jobdata$job3_1_sector) | grepl("Public", jobdata$job3_2_sector) | grepl("Public", jobdata$job3_3_sector), 1, 0)
jobdata$other <- ifelse(grepl("Other", jobdata$job1_1_sector) | grepl("Other", jobdata$job2_1_sector) | grepl("Other", jobdata$job2_2_sector) | grepl("Other", jobdata$job3_1_sector) | grepl("Other", jobdata$job3_2_sector) | grepl("Other", jobdata$job3_3_sector), 1, 0)
謝謝!
對於復雜的操作,通常首先將操作變成一個函數,然后將其應用於每種情況,通常會很有用。 例如,
get_sector <- function(x, sector) {
apply(x, 1, function(y) {
as.numeric(any(grepl(sector, y), na.rm = TRUE))
})
}
jobdata$private <- get_sector(jobdata, "Private")
jobdata$public <- get_sector(jobdata, "Public")
jobdata$other <- get_sector(jobdata, "Other")
tidyverse / dplyr解決方案是首先將許多作業列壓縮為一組標簽和值:
library(tidyverse)
jobdata.long <- jobdata %>%
gather(job.number, sector, -names)
names job.number sector
1 person1 job1_1_sector Private
2 person2 job1_1_sector Public
3 person3 job1_1_sector Private
4 person1 job2_1_sector <NA>
5 person2 job2_1_sector Public
6 person3 job2_1_sector Private
7 person1 job2_2_sector Private
8 person2 job2_2_sector Public
9 person3 job2_2_sector Other
...
然后將您的正則表達式應用於新創建的“扇區”列,可能與summarize
,以便為每個人和每個類別獲得一個TRUE / FALSE標志:
job.types <- jobdata.long %>%
group_by(names) %>%
summarize(
private = any(grepl('Private', sector)),
public = any(grepl('Public', sector)),
other = any(grepl('Other', sector))
)
names private public other
<fctr> <lgl> <lgl> <lgl>
1 person1 TRUE FALSE TRUE
2 person2 TRUE TRUE FALSE
3 person3 TRUE FALSE TRUE
您可以像這樣使用非常強大(s)apply
家庭:
# define the types
type <- c("Private", "Public", "Other")
# columns in question
mask <- grepl("^job\\d+_\\d+_sector", colnames(jobdata))
# apply(..., 1, ...) means row-wise
jobdata[type] <- t(apply(jobdata[mask], 1, function(x) {
(s <- sapply(type, function(y) {
as.numeric(y %in% x)
}))
}))
這產生
names job1_1_sector job2_1_sector job2_2_sector job3_1_sector job3_2_sector job3_3_sector Private Public Other
1 person1 Private <NA> Private Private Other Private 1 0 1
2 person2 Public Public Public Private Public <NA> 1 1 0
3 person3 Private Private Other Private Other Private 1 0 1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.