This is an easy one I think but I cannot see what I'm missing. I want to split the string at the first digit. Works great until there is a non-alphanumeric symbol in the string. Help!
Works:
pet<-c("Dog 100","Cat? 340")
df<-as.data.frame(pet)
df_split<-separate(df, pet, into = c("Animal", "Total"), sep = "(?<=[a-zA-Z])\\s*(?=[0-9])")
The first line works great but the second line does not split. Where am I going wrong?
We can use read.table
from base R
read.table(text = sub("?", "", df$pet, fixed = TRUE), header = FALSE,
col.names = c("Animal", "Total"))
# Animal Total
#1 Dog 100
#2 Cat 340
Note that for the current scenario , it is enough to split with 1+ whitespaces that are followed with 1+ digits to the end of the string:
> separate(df, pet, into = c("Animal", "Total"), sep = "\\s+(?=[0-9]+$)")
## => Animal Total
## => 1 Dog 100
## => 2 Cat? 340
See the regex demo .
However, in a general case , it is much easier to use tidyr::extract
here since the pattern you need will be miuch simpler:
^(\D*?)\s*(\d.*)
Note that if your strings can have newlines, you will need to prepend the pattern with (?s)
, a so-called DOTALL modifier that allows .
to match line break chars in an ICU pattern.
See the regex demo .
Regex details
^
- start of string (\D*?)
- Group 1 (here, Animal
column): any 0+ non-digit symbols, as few as possible \s*
- 0 or more whitespaces (\d.*)
- Group 2 (here, Total
column): a digit followed with any 0+ chars (other than line break chars if (?s)
is not used), as many as possible ( *
is a greedy quantifier). R code snippet:
library(tidyr)
df_split<-extract(df, pet, into = c("Animal", "Total"), regex="(\\D*)(\\d.*)")
df_split
# => Animal Total
# => 1 Dog 100
# => 2 Cat? 340
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.