[英]Extract specific words using regex or substring from a column in R
我有以下数据:
Opex_Spend_Month Opex_Spend_YTD Major_Category NBS_Region Sub_Category
92179.84 113542.84 Contingent Labour EUROPE TEMP:OTH.CONT.WORKER
297.82 82392.82 Contingent Labour EUROPE TEMP:OTH.CONT.WORKER
13974.8 34917.8 Contingent Labour EUROPE TEMP:OTH.CONT.WORKER
138.6 63125.6 Contingent Labour EUROPE TEMP:OTH.CONT.WORKER
NA 73097 Contingent Labour EUROPE TEMP:MSP NON IT
NA 96035 Contingent Labour EUROPE TEMP:MSP NON IT
1388.65 68934.65 Contingent Labour EUROPE TEMP:MSP NON IT
5393.76 18748.76 Contingent Labour EUROPE TEMP:MSP IT
528.38 82195.38 Contingent Labour EUROPE TEMP:MSP IT
22369 95468 Contingent Labour EUROPE TEMP:MSP IT
我想从Sub_Category列中选择Cont Worker,Non IT和IT的最后部分,而且我不确定要使用哪个正则表达式或子字符串函数。
期望的输出
Opex_Spend_Month Opex_Spend_YTD Major_Category NBS_Region Sub_Category Category
92179.84 113542.84 Contingent Labour EUROPE TEMP:OTH.CONT.WORKER Cont Worker
297.82 82392.82 Contingent Labour EUROPE TEMP:OTH.CONT.WORKER Cont Worker
13974.8 34917.8 Contingent Labour EUROPE TEMP:OTH.CONT.WORKER Cont Worker
138.6 63125.6 Contingent Labour EUROPE TEMP:OTH.CONT.WORKER Cont Worker
NA 73097 Contingent Labour EUROPE TEMP:MSP NON IT Non IT
NA 96035 Contingent Labour EUROPE TEMP:MSP NON IT Non IT
1388.65 68934.65 Contingent Labour EUROPE TEMP:MSP NON IT Non IT
5393.76 18748.76 Contingent Labour EUROPE TEMP:MSP IT IT
528.38 82195.38 Contingent Labour EUROPE TEMP:MSP IT IT
22369 95468 Contingent Labour EUROPE TEMP:MSP IT IT
有人可以帮我吗?
我们可以使用str_extract
library(stringr)
str_extract(df1$Sub_Category, "(CONT\\.WORKER|NON IT|IT)$")
You can do:
gsub(".*?(\\.|\\s)(\\w+)","\\2 ",dat$Sub_Category)
这是一个示例:只需调用最后两列(5:6),您就会看到发生了什么:
transform(dat,category=gsub(".*?(\\.|\\s)(\\w+)","\\2 ",Sub_Category))[5:6]
Sub_Category category
1 TEMP:OTH.CONT.WORKER CONT WORKER
2 TEMP:OTH.CONT.WORKER CONT WORKER
3 TEMP:OTH.CONT.WORKER CONT WORKER
4 TEMP:OTH.CONT.WORKER CONT WORKER
5 TEMP:MSP NON IT NON IT
6 TEMP:MSP NON IT NON IT
7 TEMP:MSP NON IT NON IT
8 TEMP:MSP IT IT
9 TEMP:MSP IT IT
10 TEMP:MSP IT IT
在Base R中:
df$Category = trimws(gsub('([A-Z]+:[A-Z]+|\\.)', ' ', df$Sub_Category))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.