简体   繁体   中英

How to count letters in a string and return the highest occurring letter for rows in a data frame in R

I have a column in a data frame that consists of letters describing wind directions. I need to find the most common direction for each row, which would involve counting the number of occurrences of each letter, and then selecting the letter that was most common. This is an example of the data frame:

structure(list(Day = c("15", "16", "17", "18", "19", "20"), Month = structure(c(4L, 
4L, 4L, 4L, 4L, 4L), .Label = c("Dec", "Nov", "Oct", "Sep"), class = "factor"), 
    Year = structure(c(2L, 2L, 2L, 2L, 2L, 2L), .Label = c("2012", 
    "2013", "2014", "2015", "2018", "2019", "2020"), class = "factor"), 
    Time = structure(c(10L, 10L, 10L, 10L, 10L, 10L), .Label = c("1-2pm", 
    "10-11am", "11-12am", "12-1pm", "2-3pm", "3-4pm", "4-5pm", 
    "5-6pm", "7-8am", "8-9am", "9-10am"), class = "factor"), 
    Direction_Abrev = c("S-SE", "S-SE", "SW-S", "W-SE", "W-SW", 
    "SW-S")), row.names = c(NA, 6L), class = "data.frame")

I would like the resulting data frame to be like the following:

  Day Month Year  Time Direction_Abrev
1  15   Sep 2013 8-9am              S
2  16   Sep 2013 8-9am              S
3  17   Sep 2013 8-9am              S
4  18   Sep 2013 8-9am           W-SE
5  19   Sep 2013 8-9am              W
6  20   Sep 2013 8-9am              S

that returns the most common letter. There is an issue (like row 4), where all letters are equally common. In these cases I would like to return the original value if that is possible. Thanks in advance.

sapply(dat$Direction_Abrev, function(s) {
  counts <- sort(table(setdiff(strsplit(s, ""), "-")), decreasing = TRUE)
  if (length(counts) < 2 || counts[1] == counts[2]) s else names(counts)[1]
})
#   S-SE   S-SE   SW-S   W-SE   W-SW   SW-S 
#    "S"    "S"    "S" "W-SE"    "W"    "S" 

Here is a base R option using strsplit + intersect

transform(
  df,
  Direction_Abrev = unlist(
    ifelse(
      lengths(
        v <- sapply(
          strsplit(Direction_Abrev, "-"),
          function(x) do.call(intersect, strsplit(x, ""))
        )
      ),
      v,
      Direction_Abrev
    )
  )
)

which gives

  Day Month Year  Time Direction_Abrev
1  15   Sep 2013 8-9am               S
2  16   Sep 2013 8-9am               S
3  17   Sep 2013 8-9am               S
4  18   Sep 2013 8-9am            W-SE
5  19   Sep 2013 8-9am               W
6  20   Sep 2013 8-9am               S

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM