I have a list of plant species and the counties in which they occur. I would like to create a new data frame with the the plant species and a column for each county with 1 if the plant occurs in that county and 0 if it does not.
Here are some sample data:
Accepted.Symbol County
ABRON TX(Andrews, Armstrong, Bailey, Brewster)
ABAM2 TX(Brooks, Hidalgo, Jim Hogg, Kenedy, Kleberg, Live Oak, Starr)
ABAN TX(Brewster, Culberson, El Paso, Ellis, Hudspeth, Presidio, Reeves)
ABCA TX(Culberson)
ABFR2 TX(Andrews, Armstrong, Bailey, Briscoe)
ABMA5 TX(Freestone, Leon, Robertson)
ABUTI TX(Andrews, Aransas, Atascosa, Bastrop)
Example county list data:
Anderson
Andrews
Angelina
Aransas
Archer
Here is what I want the output to look like (note that the name of plant column doesn't matter, but the names of the county columns do):
Plant Anderson Andrews
ABRON 0 1
ABAM2 0 0
I have written a function to attempt this re-organization, because I will have to update it periodically. In the function below, "data" is the list of plants with counties and "list" is a separate list of all the counties.
county.list<-function(data, list) {
output <- data.frame(data$Accepted.Symbol) #creates output dataset
for (i in 1:length(list)) {
county<-list[i]
test<-grepl(as.character(county), data$County) #outputs T/F for county name
test.1<-test*1 #converts T/F to 1/0
output<-cbind(output, test.1) #adds column to output dataset
names(output)[names(output)=="test.1"] <- as.character(county) #renames column
}
return(output)}
t1<-county.list(plants,counties)
When I run this function, I get a dataframe with 2 columns. The first has all the plant codes. The second column is all 0 with a column name of "c(1,2,3,...,267)". When I test the steps outside the "for" loop (for a single county), every step works, so I suspect that the problem lies in the loop.
I have searched for other similar questions, but none quite capture what I'm trying to do. I'm open to using methods other than a loop if that will work better.
Thanks in advance.
We can remove the parentheses ()
and the prefix before (
in the 'County' column for the first dataset ('df1'), use cSplit
from splitstackshape
to split ( ,
) the 'County' and format the dataset to long, change the 'Accepted.Symbol' to 'factor' class, set the key column as 'County' ( setkey
), join with 'df2', and then dcast
from the devel version of data.table
from 'long' format to 'wide'.
Instructions to install the devel version of data.table
are here
library(data.table)#v1.9.5+
library(splitstackshape)
df1$County <- gsub('.*\\(|\\)', '', df1$County)
dcast(
setkey(
cSplit(df1, 'County', ',', 'long')[,
Accepted.Symbol:= factor(Accepted.Symbol)],
County)[df2],
Accepted.Symbol~County, value.var='County', length, drop=FALSE)
df1 <- structure(list(Accepted.Symbol = c("ABRON", "ABAM2", "ABAN",
"ABCA", "ABFR2", "ABMA5", "ABUTI"), County = c("TX(Andrews, Armstrong,
Bailey, Brewster)",
"TX(Brooks, Hidalgo, Jim Hogg, Kenedy, Kleberg, Live Oak, Starr)",
"TX(Brewster, Culberson, El Paso, Ellis, Hudspeth, Presidio, Reeves)",
"TX(Culberson)", "TX(Andrews, Armstrong, Bailey, Briscoe)",
"TX(Freestone, Leon, Robertson)",
"TX(Andrews, Aransas, Atascosa, Bastrop)")),
.Names = c("Accepted.Symbol",
"County"), class = "data.frame", row.names = c(NA, -7L))
df2 <- structure(list(County = c("Anderson", "Andrews", "Angelina",
"Aransas", "Archer")), .Names = "County", class = "data.frame",
row.names = c(NA, -5L))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.