[英]Cut function alternative in R
I have some data in the form: 我有一些数据的形式:
Person.ID Household.ID Composition
1 4593 1A_0C
2 4992 2A_1C
3 9843 1A_1C
4 8385 2A_2C
5 9823 8A_1C
6 3458 1C_9C
7 7485 2C_0C
: : :
We can think of the composition variable as a count of adults/children ie 2A_1C would equate to two adults and two children. 我们可以将组成变量视为成人/儿童的数量,即2A_1C等于两个成人和两个孩子。
What I want to do is reduce the amount of possible levels of composition. 我想做的是减少可能的构图量。 For person 5 we have composition of 8A_1C, I am looking for a way to reduce this to 4+A_0C.
对于第5个人,我们的成分为8A_1C,我正在寻找一种方法将其降低至4 + A_0C。 So for example we would have 4+ for any composition value with greater than 4A.
因此,例如,对于任何大于4A的成分值,我们将有4+。
Person.ID Household.ID Composition
5 9823 4+A_1C
6 3458 1A_4+C
: : :
I am unsure of how to do this in R, I am thinking of using filter() or select() from dyplyr . 我不确定如何在R中执行此操作,我在考虑使用dyplyr的 filter()或select() 。 Otherwise I would need to use some sort of regular expression.
否则,我将需要使用某种正则表达式。
Any help would be appreciated. 任何帮助,将不胜感激。 Thanks
谢谢
We can use gsub
: 我们可以使用
gsub
:
df$Composition <- gsub("(?<!\\d)([5-9]|\\d{2,})(?=[AC])", "4+", df$Composition, perl = TRUE)
This assumes that 2 or more consecutive digits represent a number that's always greater than 4 (ie no 01, 02, or 001). 假设2个或更多连续数字代表一个始终大于4的数字(即,不包括01、02或001)。
Output: 输出:
Person.ID Household.ID Composition
1 1 4593 1A_0C
2 2 4992 2A_1C
3 3 9843 1A_1C
4 4 8385 2A_2C
5 5 9823 4+A_1C
6 6 3458 1C_4+C
7 7 7485 2C_0C
Data: 数据:
Person.ID <- c(1,2,3,4,5,6,7,8)
Household.ID <- c(4593,4992,9843,8385,9823,3458,7485)
Composition <- c("1A_0C","2A_1C","1A_1C","2A_2C","8A_1C","1A_9C","2A_0C")
dat <- tibble(Person.ID, Household.ID, Composition)
Function: 功能:
above4 <- function(f){
ff <- gsub("[^0-9]","",f)
if(ff>4){return("4+")}
if(ff<=4){return(ff)}
}
Apply function (done on separated data, but can recombine after): 应用功能(对分离的数据执行,但之后可以重新组合):
dat_ <- dat %>% tidyr::separate(., col=Composition,
into=c("Adults", "Children"),
sep="_") %>%
dplyr::mutate(Adults_ = unlist(lapply(Adults,above4)),
Children_ = unlist(lapply(Children,above4)))
You might then use select, filter to get your required dataset. 然后,您可以使用选择过滤器来获取所需的数据集。
dat_ %>% dplyr::mutate(Composition_ = paste0(Adults_, "A_", Children_, "C")) %>%
dplyr::select(Person.ID, Household.ID, Composition=Composition_)
# A tibble: 7 x 3
Person.ID Household.ID Composition
<dbl> <dbl> <chr>
1 1. 4593. 1A_0C
2 2. 4992. 2A_1C
3 3. 9843. 1A_1C
4 4. 8385. 2A_2C
5 5. 9823. 4+A_1C
6 6. 3458. 1A_4+C
7 7. 7485. 2A_0C
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.