简体   繁体   English

根据来自其他列的正则表达式捕获组的值有效地将列添加到数据框

[英]Efficiently adding column to dataframe based on values from regex capture groups from other columns

I wanted to add an additional column to an existing dataframe where the value of newColumn would be based on a capture group of a regex applied to another value in the same row and the only thing I came up with that worked so far was this (probably not R-esque) standard-approach of looping but it is awefully slow (for a DF of around 1.5 million rows).我想向现有数据添加一个额外的列,其中newColumn的值将基于应用于同一行中另一个值的正则表达式的捕获组,到目前为止我想出的唯一有效的是这个(可能不是 R 式)循环的标准方法,但它非常慢(对于大约 150 万行的 DF)。

Dataframe with Columns:带列的数据框:

ID    Text    NewColumn

Atm I work with this: Atm 我处理这个:

df$newColumn <- rep("", nrow(df));
for (row in 1:nrow(df)) {
    df$newColumn[row] <- str_match(df$Text[row], regex)[1,2];
} 

I tried using apply/lapply after reading several posts but none of my approaches created the expected result.在阅读了几篇文章后,我尝试使用 apply/lapply 但我的方法都没有产生预期的结果。 Is this even possible with a function of the apply-family, and if yes: how?这甚至可以通过 apply-family 的功能实现,如果是:如何?

Example:例子:

for为了

regex <- "^[0-9]*([a-zA-Z]*)$";

and a table like the following:和如下表:

ID   Text         
------------------
1    231Ben
2    112Claudine
3    538Julia

I would expect:我希望:

ID   Text          NewColumn
----------------------------
1    231Ben          Ben
2    112Claudine     Claudine
3    538Julia        Julia

The str_match and gsub/sub etc are vectorized, so we don't have to loop through the rows if the pattern is the same str_matchgsub/sub等是矢量化的,所以如果pattern相同,我们不必遍历行

df1$NewColumn <- gsub("\\d+", "", df1$Text)

Or with stringr functions或者使用stringr函数

library(stringr)
df1$NewColumn <- str_match(df1$Text, "([A-Za-z]+)")[,1] 

str_extract(df1$Text, "[A-Za-z]+")
#[1] "Ben"      "Claudine" "Julia"  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM