[英]R - Extract info after nth occurrence of a character from the right of string
[英]Maximum occurrence of string from a character variable in R
我有一个包含两列(医院名称,类型)的数据框。两个变量都是字符变量。 数据如下所示:-
hospital_name type
ABC rural
ABC rural
ABC urban
XYZ urban
XYZ urban
EFG rural
我正在编写将按医院名称分组并计算该分组中每种类型的计数的代码。 接下来,创建一个名为type2的新列,该列将在type column中具有最高的出现次数。 所需的输出应该是:-
hospital_name type type2
ABC rural rural
XYZ urban urban
EFG rural rural
我使用dplyr解决了此问题,但出现错误。 这是我的解决方案:
library("dplyr")
df<-df%>%group_by(hospital_name)%>%mutate(type2=names(which.max(table(type))))
错误是:-
Error: incompatible types, expecting a character vector
鉴于您上面的代码没有错误运行,但没有产生所需的输出,因此我仅对其进行了一些微调,以获得所需的结果:
dat <- dplyr::data_frame(hospital_name = c("ABC", "ABC", "ABC", "XYZ", "XYZ", "EFG"),
type = c("rural", "rural", "urban", "urban", "urban", "rural"))
dat %>% group_by(hospital_name) %>%
mutate(type2 = names(which.max(table(type)))) %>%
filter(type == type2) %>%
distinct()
dat
# Source: local data frame [3 x 3]
# Groups: hospital_name [3]
#
# hospital_name type type2
# (chr) (chr) (chr)
# 1 ABC rural rural
# 2 XYZ urban urban
# 3 EFG rural rural
上面的注释表明数据在type
列中具有NA
,这似乎引发了错误。 但是,这似乎不是我机器上的问题。
dat <- data.frame(hospital_name = c("ABC", "ABC", "ABC", "XYZ", "XYZ", "EFG"),
type = c("rural", "rural", "urban", "urban", NA, "rural"))
dat
# hospital_name type
# 1 ABC rural
# 2 ABC rural
# 3 ABC urban
# 4 XYZ urban
# 5 XYZ <NA>
# 6 EFG rural
sapply(dat, class)
# hospital_name type
# "factor" "factor"
dat %>%
group_by(hospital_name) %>%
mutate(type2 = names(which.max(table(type))))
# Source: local data frame [6 x 3]
# Groups: hospital_name [3]
# hospital_name type type2
# (fctr) (fctr) (chr)
# 1 ABC rural rural
# 2 ABC rural rural
# 3 ABC urban rural
# 4 XYZ urban urban
# 5 XYZ NA urban
# 6 EFG rural rural
因此,我终于能够重现您的错误。
dat <- structure(list(NET_PARENT = c("COMMUNITY HEALTH SYSTEMS (CHS)",
"JEFFERSON HEALTH", "JEFFERSON HEALTH", "MEMORIAL HEALTH SYSTEM (SPRINGFIELD IL)",
"TENET HEALTHCARE", "TENET HEALTHCARE", "TENET HEALTHCARE", "TENET HEALTHCARE",
"LIFEPOINT HEALTH (FKA: LIFEPOINT HOSPITALS)", "INDIAN HEALTH SERVICES"
), OWNERSHIP = c("for_profit", "non-profit", "non-profit", "non-profit",
"for_profit", NA, NA, NA, "for_profit", NA)), .Names = c("NET_PARENT",
"OWNERSHIP"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 10L,
13L), class = "data.frame")
dat
# NET_PARENT OWNERSHIP
# 1 COMMUNITY HEALTH SYSTEMS (CHS) for_profit
# 2 JEFFERSON HEALTH non-profit
# 3 JEFFERSON HEALTH non-profit
# 4 MEMORIAL HEALTH SYSTEM (SPRINGFIELD IL) non-profit
# 5 TENET HEALTHCARE for_profit
# 6 TENET HEALTHCARE <NA>
# 7 TENET HEALTHCARE <NA>
# 8 TENET HEALTHCARE <NA>
# 10 LIFEPOINT HEALTH (FKA: LIFEPOINT HOSPITALS) for_profit
# 13 INDIAN HEALTH SERVICES <NA>
dat %>% group_by(NET_PARENT) %>% mutate(type2 = names(which.max(table(OWNERSHIP)))
# Error: incompatible types, expecting a character vector
之所以发生这种情况,是因为dat$NET_PARENT == "INDIAN HEALTH SERVICES"
和dat$NET_PARENT == "TENET HEALTHCARE"
最受欢迎选项是NA
。 这会在mutate
引发错误,因为它需要一个character
值而获取一个NULL
值。 我们可以通过以下更改来解决此问题。
dat %>%
group_by(NET_PARENT) %>%
mutate(type2 = ifelse(length(which.max(table(OWNERSHIP))) == 0,
"NA",
names(which.max(table(OWNERSHIP)))))
# Source: local data frame [10 x 3]
# Groups: NET_PARENT [6]
# NET_PARENT OWNERSHIP type2
# (chr) (chr) (chr)
# 1 COMMUNITY HEALTH SYSTEMS (CHS) for_profit for_profit
# 2 JEFFERSON HEALTH non-profit non-profit
# 3 JEFFERSON HEALTH non-profit non-profit
# 4 MEMORIAL HEALTH SYSTEM (SPRINGFIELD IL) non-profit non-profit
# 5 TENET HEALTHCARE for_profit for_profit
# 6 TENET HEALTHCARE NA for_profit
# 7 TENET HEALTHCARE NA for_profit
# 8 TENET HEALTHCARE NA for_profit
# 9 LIFEPOINT HEALTH (FKA: LIFEPOINT HOSPITALS) for_profit for_profit
# 10 INDIAN HEALTH SERVICES NA NA
请注意,即使最大值为NA
, type2
对于“ TENET HEALTHCARE”也是“ for_profit”。 这是因为table
未捕获NA
,因此从值中将其省略。 结果,唯一的值被记录为最大值。 但是对于“印度健康服务”,它被列为“ NA”。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.