简体   繁体   中英

How to gather columns based on the first part of the column name in R?

If i have a data set similar to the following:

# State Ben.Carson.Number.of.Votes Ben.Carson.Party Ben.Carson.Percent Bernie.Sanders.Votes Bernie Sanders.Percent Bernie.Sanders.Party 
#  OH   305.                       Republican       8.3                500                  12.30                  Democrat
#  FL   20                         Republican       3.0                700                  11.00.                 Democrat
#  TX   400.                       Republican       5.0                 50                   1.00                  Democrat

 

How do I create four unified columns, Candidate Name, Votes, Percent, and Party, from all the separate columns located in the data set currently? Ie gather together all three types of columns based on the candidate name located in the column name.

I tried the following but to no avail:

tidyElectionData %>%
  gather(key, value, -c(County, Location.State, State)) %>%
  separate(key, into = c("Candidate", "Party"), sep = "(^[^.]+[.][^.]+)(.+$)") %>%
  spread(Party, value)

A solution based in the tidyverse can look as follows.

library(dplyr)
library(tidyr)
library(stringr)

df %>%
  mutate(across(everything(), as.character)) %>%
  pivot_longer(-State) %>%
  mutate(names = str_extract(name, 'Votes|Party|Percent'),
         name = str_extract(name, 'Ben.Carson|Bernie.Sanders')) %>%
  pivot_wider(names_from = names, values_from = value)

#   State name           Votes Party      Percent
#   <chr> <chr>          <chr> <chr>      <chr>  
# 1 OH    Ben.Carson     305   Republican 8.3    
# 2 OH    Bernie.Sanders 500   Democrat   12.3   
# 3 FL    Ben.Carson     20    Republican 3      
# 4 FL    Bernie.Sanders 700   Democrat   11     
# 5 TX    Ben.Carson     400   Republican 5      
# 6 TX    Bernie.Sanders 50    Democrat   1 

Data

df <- structure(list(State = c("OH", "FL", "TX"), Ben.Carson.Number.of.Votes = c(305, 
20, 400), Ben.Carson.Party = c("Republican", "Republican", "Republican"
), Ben.Carson.Percent = c(8.3, 3, 5), Bernie.Sanders.Votes = c(500, 
700, 50), Bernie.Sanders.Percent = c(12.3, 11, 1), Bernie.Sanders.Party = c("Democrat", 
"Democrat", "Democrat")), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame"))

In base R you could do:

candidates <- unique(sub("(\\w+[.]\\w+).*","\\1",names(df)[-1]))

columns <- split(names(df[-1]),sub(".*[.]","",names(df)[-1]))

df1<-reshape(df, columns, dir = "long", times = candidates, idvar = "State")

names(df1)[-1]<-c("candidate", names(columns))
rownames(df1) <- NULL
df1
  State      candidate      Party Percent Votes
1    OH     Ben.Carson Republican     8.3   305
2    FL     Ben.Carson Republican       3    20
3    TX     Ben.Carson Republican       5   400
4    OH Bernie.Sanders   Democrat   12.30   500
5    FL Bernie.Sanders   Democrat  11.00.   700
6    TX Bernie.Sanders   Democrat    1.00    50

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM