简体   繁体   中英

Create new dataframe with multiple columns based on single character column in R

I have a list of plant species and the counties in which they occur. I would like to create a new data frame with the the plant species and a column for each county with 1 if the plant occurs in that county and 0 if it does not.

Here are some sample data:

Accepted.Symbol County
ABRON   TX(Andrews, Armstrong, Bailey, Brewster)
ABAM2   TX(Brooks, Hidalgo, Jim Hogg, Kenedy, Kleberg, Live Oak, Starr)
ABAN    TX(Brewster, Culberson, El Paso, Ellis, Hudspeth, Presidio, Reeves)
ABCA    TX(Culberson)
ABFR2   TX(Andrews, Armstrong, Bailey, Briscoe)
ABMA5   TX(Freestone, Leon, Robertson)
ABUTI   TX(Andrews, Aransas, Atascosa, Bastrop)

Example county list data:

 Anderson
 Andrews
 Angelina
 Aransas
 Archer

Here is what I want the output to look like (note that the name of plant column doesn't matter, but the names of the county columns do):

Plant  Anderson  Andrews
ABRON  0         1
ABAM2  0         0

I have written a function to attempt this re-organization, because I will have to update it periodically. In the function below, "data" is the list of plants with counties and "list" is a separate list of all the counties.

county.list<-function(data, list) {
  output <- data.frame(data$Accepted.Symbol) #creates output dataset
    for (i in 1:length(list)) {
      county<-list[i]
    test<-grepl(as.character(county), data$County) #outputs T/F for county name
    test.1<-test*1                                 #converts T/F to 1/0
    output<-cbind(output, test.1)                #adds column to output dataset
    names(output)[names(output)=="test.1"] <- as.character(county) #renames column
    }
return(output)}

t1<-county.list(plants,counties)

When I run this function, I get a dataframe with 2 columns. The first has all the plant codes. The second column is all 0 with a column name of "c(1,2,3,...,267)". When I test the steps outside the "for" loop (for a single county), every step works, so I suspect that the problem lies in the loop.

I have searched for other similar questions, but none quite capture what I'm trying to do. I'm open to using methods other than a loop if that will work better.

Thanks in advance.

We can remove the parentheses () and the prefix before ( in the 'County' column for the first dataset ('df1'), use cSplit from splitstackshape to split ( , ) the 'County' and format the dataset to long, change the 'Accepted.Symbol' to 'factor' class, set the key column as 'County' ( setkey ), join with 'df2', and then dcast from the devel version of data.table from 'long' format to 'wide'.

Instructions to install the devel version of data.table are here

library(data.table)#v1.9.5+
library(splitstackshape)
df1$County <- gsub('.*\\(|\\)', '', df1$County)
dcast(
   setkey(
     cSplit(df1, 'County', ',', 'long')[,
         Accepted.Symbol:= factor(Accepted.Symbol)],
          County)[df2],
    Accepted.Symbol~County, value.var='County', length, drop=FALSE)

data

df1 <- structure(list(Accepted.Symbol = c("ABRON", "ABAM2", "ABAN", 
"ABCA", "ABFR2", "ABMA5", "ABUTI"), County = c("TX(Andrews, Armstrong, 
Bailey, Brewster)", 
"TX(Brooks, Hidalgo, Jim Hogg, Kenedy, Kleberg, Live Oak, Starr)", 
"TX(Brewster, Culberson, El Paso, Ellis, Hudspeth, Presidio, Reeves)", 
"TX(Culberson)", "TX(Andrews, Armstrong, Bailey, Briscoe)", 
"TX(Freestone, Leon, Robertson)", 
"TX(Andrews, Aransas, Atascosa, Bastrop)")), 
 .Names = c("Accepted.Symbol", 
 "County"), class = "data.frame", row.names = c(NA, -7L))

 df2 <- structure(list(County = c("Anderson", "Andrews", "Angelina", 
 "Aransas", "Archer")), .Names = "County", class = "data.frame",
 row.names = c(NA, -5L))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM