繁体   English   中英

使用正则表达式或R中的子字符串提取特定单词

[英]Extract specific words using regex or substring from a column in R

我有以下数据:

    Opex_Spend_Month    Opex_Spend_YTD  Major_Category  NBS_Region  Sub_Category
92179.84            113542.84       Contingent Labour   EUROPE  TEMP:OTH.CONT.WORKER
297.82              82392.82        Contingent Labour   EUROPE  TEMP:OTH.CONT.WORKER
13974.8             34917.8         Contingent Labour   EUROPE  TEMP:OTH.CONT.WORKER
138.6               63125.6         Contingent Labour   EUROPE  TEMP:OTH.CONT.WORKER
NA                  73097           Contingent Labour   EUROPE  TEMP:MSP NON IT
NA                  96035           Contingent Labour   EUROPE  TEMP:MSP NON IT
1388.65             68934.65        Contingent Labour   EUROPE  TEMP:MSP NON IT
5393.76             18748.76        Contingent Labour   EUROPE  TEMP:MSP IT
528.38              82195.38        Contingent Labour   EUROPE  TEMP:MSP IT
22369               95468           Contingent Labour   EUROPE  TEMP:MSP IT

我想从Sub_Category列中选择Cont Worker,Non IT和IT的最后部分,而且我不确定要使用哪个正则表达式或子字符串函数。

期望的输出

Opex_Spend_Month    Opex_Spend_YTD  Major_Category  NBS_Region  Sub_Category            Category
92179.84            113542.84       Contingent Labour   EUROPE  TEMP:OTH.CONT.WORKER    Cont Worker
297.82              82392.82        Contingent Labour   EUROPE  TEMP:OTH.CONT.WORKER    Cont Worker
13974.8             34917.8         Contingent Labour   EUROPE  TEMP:OTH.CONT.WORKER    Cont Worker
138.6               63125.6         Contingent Labour   EUROPE  TEMP:OTH.CONT.WORKER    Cont Worker
NA                  73097           Contingent Labour   EUROPE  TEMP:MSP NON IT         Non IT
NA                  96035           Contingent Labour   EUROPE  TEMP:MSP NON IT         Non IT
1388.65             68934.65        Contingent Labour   EUROPE  TEMP:MSP NON IT         Non IT
5393.76             18748.76        Contingent Labour   EUROPE  TEMP:MSP IT             IT
528.38              82195.38        Contingent Labour   EUROPE  TEMP:MSP IT             IT
22369               95468           Contingent Labour   EUROPE  TEMP:MSP IT             IT

有人可以帮我吗?

我们可以使用str_extract

library(stringr)
str_extract(df1$Sub_Category, "(CONT\\.WORKER|NON IT|IT)$")
You can do:

 gsub(".*?(\\.|\\s)(\\w+)","\\2 ",dat$Sub_Category)

这是一个示例:只需调用最后两列(5:6),您就会看到发生了什么:

transform(dat,category=gsub(".*?(\\.|\\s)(\\w+)","\\2 ",Sub_Category))[5:6]
           Sub_Category     category
1  TEMP:OTH.CONT.WORKER CONT WORKER 
2  TEMP:OTH.CONT.WORKER CONT WORKER 
3  TEMP:OTH.CONT.WORKER CONT WORKER 
4  TEMP:OTH.CONT.WORKER CONT WORKER 
5       TEMP:MSP NON IT      NON IT 
6       TEMP:MSP NON IT      NON IT 
7       TEMP:MSP NON IT      NON IT 
8           TEMP:MSP IT          IT 
9           TEMP:MSP IT          IT 
10          TEMP:MSP IT          IT 

在Base R中:

df$Category = trimws(gsub('([A-Z]+:[A-Z]+|\\.)', ' ', df$Sub_Category))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM