简体   繁体   中英

How to split a string into multiple columns?

I have a string that looks like this:

# character string
string <- "lambs:    cows: 281        chickens: 20   goats: 3     trees: 13"

I want to create a dataframe that looks like this:

# structure
lambs <- NA
cows <- 281
chickens <- 20
goats <- 3
trees <- 13

# dataframe
df <- 
  cbind(lambs, cows, chickens, goats, trees)  %>% 
  as.data.frame()

This is what I have tried so far:

# split string
test <- strsplit(string, " ")
test

The data is quite unclean so the spacing isn't always consistent, and sometimes there are lambs and sometimes there are no lambs (as in: "lamb: 5 cow: 50" and "lamb: cow: 40" . What is the easiest way to do this using tidyverse?

You can use str_match_all and pass the pattern to extract.

tmp <- stringr::str_match_all(string, '\\s*(.*?):\\s*(\\d+)?')[[1]][, -1]
data <- type.convert(data.frame(tmp), as.is = TRUE)

#        X1  X2
#1    lambs  NA
#2     cows 281
#3 chickens  20
#4    goats   3
#5    trees  13

This divides data into two columns where the first column is everything before colon ( : ) except whitespace and the second column is number followed after it. I have made the number part as optional so as to accommodate cases like 'lambs' which do not have number.

Try this:

gre <- gregexpr("\\b([A-Za-z]+:\\s*[0-9]*)\\b", string)
regmatches(string, gre)
# [[1]]
# [1] "lambs:    "   "cows: 281"    "chickens: 20" "goats: 3"     "trees: 13"   
lapply(regmatches(string, gre), strcapture, pattern = "(.*):(.*)", proto = list(anim = character(0), n = character(0)))
# [[1]]
#       anim    n
# 1    lambs     
# 2     cows  281
# 3 chickens   20
# 4    goats    3
# 5    trees   13
frames <- lapply(regmatches(string, gre), strcapture,
                 pattern = "(.*):(.*)", proto = list(anim = character(0), n = character(0)))

If you have multiple strings (and not just one), then this ensure that each string is processed and then all data is combined.

alldat <- do.call(rbind, frames)
alldat$n <- as.integer(alldat$n)
alldat
#       anim   n
# 1    lambs  NA
# 2     cows 281
# 3 chickens  20
# 4    goats   3
# 5    trees  13

If you instead really need the data in a "wide" format, then

do.call(rbind, lapply(frames, function(z) do.call(data.frame, setNames(as.list(as.integer(z$n)), z$anim))))
#   lambs cows chickens goats trees
# 1    NA  281       20     3    13

You can try read.table . The "no lambs" issue can be solved by putting in a zero with gsub .

r <- na.omit(unlist(read.table(text=gsub(": ", " 0", string), sep=" ")))
r <- replace(r, r == 0, NA)

## long format
type.convert(as.data.frame(matrix(r, ncol=2, byrow=TRUE)), as.is=TRUE)
#         V1  V2
# 1    lambs  NA
# 2     cows 281
# 3 chickens  20
# 4    goats   3
# 5    trees  13

## wide format
setNames(type.convert(r[seq(r) %% 2 == 0]), r[seq(r) %% 2 == 1])
# lambs     cows chickens    goats    trees 
#    NA      281       20        3       13 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM