Efficiently adding column to dataframe based on values from regex capture groups from other columns

Question

I wanted to add an additional column to an existing dataframe where the value of newColumn would be based on a capture group of a regex applied to another value in the same row and the only thing I came up with that worked so far was this (probably not R-esque) standard-approach of looping but it is awefully slow (for a DF of around 1.5 million rows).

Dataframe with Columns:

ID    Text    NewColumn

Atm I work with this:

df$newColumn <- rep("", nrow(df));
for (row in 1:nrow(df)) {
    df$newColumn[row] <- str_match(df$Text[row], regex)[1,2];
}

I tried using apply/lapply after reading several posts but none of my approaches created the expected result. Is this even possible with a function of the apply-family, and if yes: how?

Example:

for

regex <- "^[0-9]*([a-zA-Z]*)$";

and a table like the following:

ID   Text         
------------------
1    231Ben
2    112Claudine
3    538Julia

I would expect:

ID   Text          NewColumn
----------------------------
1    231Ben          Ben
2    112Claudine     Claudine
3    538Julia        Julia

Answer 1

The str_match and gsub/sub etc are vectorized, so we don't have to loop through the rows if the pattern is the same

df1$NewColumn <- gsub("\\d+", "", df1$Text)

Or with stringr functions

library(stringr)
df1$NewColumn <- str_match(df1$Text, "([A-Za-z]+)")[,1] 

str_extract(df1$Text, "[A-Za-z]+")
#[1] "Ben"      "Claudine" "Julia"

Efficiently adding column to dataframe based on values from regex capture groups from other columns

Question

1 answers

solution1
1 ACCPTED 2018-03-03 14:18:50

Efficiently adding column to dataframe based on values from regex capture groups from other columns

Question

1 answers

solution1 1 ACCPTED 2018-03-03 14:18:50

solution1
1 ACCPTED 2018-03-03 14:18:50