簡體   English   中英

在另一個字段上使用RegEx在R data.table中創建新字段

[英]Create new field in R data.table using RegEx on another field

給定此data.table

library(data.table)

dt <- data.table(f1 =  c(
  "stuffstuff-0000097125",
  "stuffstuff.abc.0006496679",
  "stuffstuff0007517235",
  "stuffstuff_xyz.0007280719",
  "stuffstuff0005995303",
  "stuffstuff_a1b_0000143856",
  "stuffstuff0009362407",
  "stuffstuff.c44_0009735298"
))

想得到這些結果:

                          f1 parsed_val
1:     stuffstuff-0000097125        
2: stuffstuff.abc.0006496679        abc
3:      stuffstuff0007517235        
4: stuffstuff_xyz.0007280719        xyz
5:      stuffstuff0005995303        
6: stuffstuff_a1b_0000143856        a1b
7:      stuffstuff0009362407        
8: stuffstuff.c44_0009735298        c44

這是我嘗試過的:

rex_pattern <- "(?<=(\\.|\\_|\\-))[A-Za-z0-9]{3}(?=(\\.|\\_|\\-)[0-9]{3,})"

dt[, `:=`(parsed_val = regmatches(f1, regexpr(pattern = rex_pattern, f1, perl = TRUE)))]  

但是,由於回收,這些是我得到的結果:

                          f1 parsed_val
1:     stuffstuff-0000097125        abc
2: stuffstuff.abc.0006496679        xyz
3:      stuffstuff0007517235        a1b
4: stuffstuff_xyz.0007280719        c44
5:      stuffstuff0005995303        abc
6: stuffstuff_a1b_0000143856        xyz
7:      stuffstuff0009362407        a1b
8: stuffstuff.c44_0009735298        c44

我試圖在函數中使用ifelse返回空字符串:

getMmFromFilename <- function(my_file_name){
rex_pattern <- "(?<=(\\.|\\_|\\-))[A-Za-z0-9]{3}(?=(\\.|\\_|\\-)[0-9]{3,})"
nothing_found <- character(length = 0)

mm <- regmatches(my_file_name, regexpr(pattern = rex_pattern, my_file_name, perl = TRUE))
ifelse(identical(mm, nothing_found), "missing_Mm", mm)
}

dt[, .(parsed_val = getMmFromFilename(f1))]

但這僅返回1的abc值。 regmatches文檔說:“對於向量匹配數據(從regexpr獲得),將刪除空匹配項;對於列表匹配數據,空匹配將給出空組件(零長度字符向量)。” 我猜想解決方案就在這里,但我還沒有得到...

至於解決方案,我的工作流程要求我使用data.table ,對解決方案的簡要說明將有很大的幫助...

提前致謝。

dt[,parser_val:=sub(".*?[._](.*)[._].*|.*","\\1",f1)]
dt
                          f1 parser_val
1:     stuffstuff-0000097125           
2: stuffstuff.abc.0006496679        abc
3:      stuffstuff0007517235           
4: stuffstuff_xyz.0007280719        xyz
5:      stuffstuff0005995303           
6: stuffstuff_a1b_0000143856        a1b
7:      stuffstuff0009362407           
8: stuffstuff.c44_0009735298        c44

如果要使用regmatches ,則可以使用pattern="(?<=[._]).*(?=[._])|$"perl=TRUE

dt[,parser_val:=regmatches(dt$f1,regexpr("(?<=[._]).*(?=[._])|$",dt$f1,perl = T))]
> dt
                          f1 parser_val
1:     stuffstuff-0000097125           
2: stuffstuff.abc.0006496679        abc
3:      stuffstuff0007517235           
4: stuffstuff_xyz.0007280719        xyz
5:      stuffstuff0005995303           
6: stuffstuff_a1b_0000143856        a1b
7:      stuffstuff0009362407           
8: stuffstuff.c44_0009735298        c44

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM