Split words in R Dataframe column

Question

I have a data frame with words in a column separated by single space. I want to split it into three types as below. Data frame looks as below.

Text
one of the
i want to

I want to split it into as below.

Text         split1     split2    split3
one of the    one       one of     of the

I am able to achieve 1st. Not able to figure out the other two.

my code to get split1:

new_data$split1<-sub(" .*","",new_data$Text)

Figured out the split2:

df$split2 <- gsub(" [^ ]*$", "", df$Text)

Answer 1

There might be more elegant solutions. Here are two options:

Using ngrams :

library(dplyr); library(tm)
df %>% mutate(splits = strsplit(Text, "\\s+")) %>% 
       mutate(split1 = lapply(splits, `[`, 1)) %>% 
       mutate(split2 = lapply(splits, function(words) ngrams(words, 2)[[1]]), 
              split3 = lapply(splits, function(words) ngrams(words, 2)[[2]])) %>% 
       select(-splits)

        Text split1  split2   split3
1 one of the    one one, of  of, the
2  i want to      i i, want want, to

Extract the two grams manually:

df %>% mutate(splits = strsplit(Text, "\\s+")) %>% 
       mutate(split1 = lapply(splits, `[`, 1)) %>% 
       mutate(split2 = lapply(splits, `[`, 1:2), 
              split3 = lapply(splits, `[`, 2:3)) %>% 
       select(-splits)

        Text split1  split2   split3
1 one of the    one one, of  of, the
2  i want to      i i, want want, to

Update :

With regular expression, we can use back reference of gsub.

Split2:

gsub("((.*)\\s+(.*))\\s+(.*)", "\\1", df$Text)
[1] "one of" "i want"

Split3:

gsub("(.*)\\s+((.*)\\s+(.*))", "\\2", df$Text)
[1] "of the"  "want to"

Answer 2

We can try with gsub . Capture one or more non-white space ( \\\\S+ ) as a group (in this case there are 3 words), then in the replacement, we rearrange the backreference and insert a delimiter ( , ) which we use for converting to different columns with read.table .

 df1[paste0("split", 1:3)] <- read.table(text=gsub("(\\S+)\\s+(\\S+)\\s+(\\S+)", 
                  "\\1,\\1 \\2,\\2 \\3", df1$Text), sep=",")
df1
#        Text split1 split2  split3
#1 one of the    one one of  of the
#2  i want to      i i want want to

data

df1 <- structure(list(Text = c("one of the", "i want to")), 
.Names = "Text", class = "data.frame", row.names = c(NA, -2L))

Answer 3

This is a bit of hackish solution.

Assumption :- you are not concerned about number of spaces between two words.

> library(stringr)
> x<-c('one of the','i want to')
> strsplit(gsub('(\\S+)\\s+(\\S+)\\s+(.*)', '\\1  \\1 \\2   \\2 \\3', x), '\\s\\s+')
#[[1]]
#[1] "one"    "one of" "of the"

#[[2]]
#[1] "i"       "i want"  "want to"

Split words in R Dataframe column

Question

3 answers

solution1
0 2016-06-03 13:46:02

solution2
0 ACCPTED 2016-06-03 15:21:42

data

solution3
0 2016-06-03 15:23:01

Split words in R Dataframe column

Question

3 answers

solution1 0 2016-06-03 13:46:02

solution2 0 ACCPTED 2016-06-03 15:21:42

data

solution3 0 2016-06-03 15:23:01

solution1
0 2016-06-03 13:46:02

solution2
0 ACCPTED 2016-06-03 15:21:42

solution3
0 2016-06-03 15:23:01