繁体   English   中英

R中字符变量最多出现的字符串

[英]Maximum occurrence of string from a character variable in R

我有一个包含两列(医院名称,类型)的数据框。两个变量都是字符变量。 数据如下所示:-

hospital_name  type
ABC            rural
ABC            rural
ABC            urban
XYZ            urban
XYZ            urban
EFG            rural

我正在编写将按医院名称分组并计算该分组中每种类型的计数的代码。 接下来,创建一个名为type2的新列,该列将在type column中具有最高的出现次数。 所需的输出应该是:-

hospital_name  type  type2
ABC            rural rural
XYZ            urban urban
EFG            rural rural        

我使用dplyr解决了此问题,但出现错误。 这是我的解决方案:

library("dplyr")
df<-df%>%group_by(hospital_name)%>%mutate(type2=names(which.max(table(type))))

错误是:-

Error: incompatible types, expecting a character vector

鉴于您上面的代码没有错误运行,但没有产生所需的输出,因此我仅对其进行了一些微调,以获得所需的结果:

dat <- dplyr::data_frame(hospital_name = c("ABC", "ABC", "ABC", "XYZ", "XYZ", "EFG"), 
                         type = c("rural", "rural", "urban", "urban", "urban", "rural"))

dat %>% group_by(hospital_name) %>% 
  mutate(type2 = names(which.max(table(type)))) %>% 
  filter(type == type2) %>% 
  distinct()

dat
# Source: local data frame [3 x 3]
# Groups: hospital_name [3]
#
#   hospital_name  type type2
#           (chr) (chr) (chr)
# 1           ABC rural rural
# 2           XYZ urban urban
# 3           EFG rural rural

更新

上面的注释表明数据在type列中具有NA ,这似乎引发了错误。 但是,这似乎不是我机器上的问题。

dat <- data.frame(hospital_name = c("ABC", "ABC", "ABC", "XYZ", "XYZ", "EFG"), 
                  type = c("rural", "rural", "urban", "urban", NA, "rural"))
dat
#   hospital_name  type
# 1           ABC rural
# 2           ABC rural
# 3           ABC urban
# 4           XYZ urban
# 5           XYZ  <NA>
# 6           EFG rural

sapply(dat, class)
# hospital_name          type 
#      "factor"      "factor" 

dat %>% 
  group_by(hospital_name) %>% 
  mutate(type2 = names(which.max(table(type))))

# Source: local data frame [6 x 3]
# Groups: hospital_name [3]

#   hospital_name   type type2
#          (fctr) (fctr) (chr)
# 1           ABC  rural rural
# 2           ABC  rural rural
# 3           ABC  urban rural
# 4           XYZ  urban urban
# 5           XYZ     NA urban
# 6           EFG  rural rural

更新2

因此,我终于能够重现您的错误。

dat <- structure(list(NET_PARENT = c("COMMUNITY HEALTH SYSTEMS (CHS)", 
"JEFFERSON HEALTH", "JEFFERSON HEALTH", "MEMORIAL HEALTH SYSTEM (SPRINGFIELD IL)", 
"TENET HEALTHCARE", "TENET HEALTHCARE", "TENET HEALTHCARE", "TENET HEALTHCARE", 
"LIFEPOINT HEALTH (FKA: LIFEPOINT HOSPITALS)", "INDIAN HEALTH SERVICES"
), OWNERSHIP = c("for_profit", "non-profit", "non-profit", "non-profit", 
"for_profit", NA, NA, NA, "for_profit", NA)), .Names = c("NET_PARENT", 
"OWNERSHIP"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 10L, 
13L), class = "data.frame")

dat

#                                     NET_PARENT  OWNERSHIP
# 1               COMMUNITY HEALTH SYSTEMS (CHS) for_profit
# 2                             JEFFERSON HEALTH non-profit
# 3                             JEFFERSON HEALTH non-profit
# 4      MEMORIAL HEALTH SYSTEM (SPRINGFIELD IL) non-profit
# 5                             TENET HEALTHCARE for_profit
# 6                             TENET HEALTHCARE       <NA>
# 7                             TENET HEALTHCARE       <NA>
# 8                             TENET HEALTHCARE       <NA>
# 10 LIFEPOINT HEALTH (FKA: LIFEPOINT HOSPITALS) for_profit
# 13                      INDIAN HEALTH SERVICES       <NA>

dat %>% group_by(NET_PARENT) %>% mutate(type2 = names(which.max(table(OWNERSHIP)))
# Error: incompatible types, expecting a character vector

之所以发生这种情况,是因为dat$NET_PARENT == "INDIAN HEALTH SERVICES"dat$NET_PARENT == "TENET HEALTHCARE"最受欢迎选项是NA 这会在mutate引发错误,因为它需要一个character值而获取一个NULL值。 我们可以通过以下更改来解决此问题。

dat %>%
  group_by(NET_PARENT) %>%
  mutate(type2 = ifelse(length(which.max(table(OWNERSHIP))) == 0,
                        "NA",
                        names(which.max(table(OWNERSHIP)))))

# Source: local data frame [10 x 3]
# Groups: NET_PARENT [6]

#                                     NET_PARENT  OWNERSHIP      type2
#                                          (chr)      (chr)      (chr)
# 1               COMMUNITY HEALTH SYSTEMS (CHS) for_profit for_profit
# 2                             JEFFERSON HEALTH non-profit non-profit
# 3                             JEFFERSON HEALTH non-profit non-profit
# 4      MEMORIAL HEALTH SYSTEM (SPRINGFIELD IL) non-profit non-profit
# 5                             TENET HEALTHCARE for_profit for_profit
# 6                             TENET HEALTHCARE         NA for_profit
# 7                             TENET HEALTHCARE         NA for_profit
# 8                             TENET HEALTHCARE         NA for_profit
# 9  LIFEPOINT HEALTH (FKA: LIFEPOINT HOSPITALS) for_profit for_profit
# 10                      INDIAN HEALTH SERVICES         NA         NA

请注意,即使最大值为NAtype2对于“ TENET HEALTHCARE”也是“ for_profit”。 这是因为table未捕获NA ,因此从值中将其省略。 结果,唯一的值被记录为最大值。 但是对于“印度健康服务”,它被列为“ NA”。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM