简体   繁体   English

R 正则表达式提取 substring 后跟行尾或特定字符(惰性匹配)

[英]R regex extract substring followed by end of line or specific character (lazy match)

I have a vector mystr the elements of which contain the unit of measure for a given parameter - this is indicated by the letters, symbols etc. following UOM= .我有一个向量mystr ,其元素包含给定参数的度量单位 - 这由UOM=后面的字母、符号等表示。 This may be placed at the end of a string or delimited by a semicolon ;这可以放在字符串的末尾或用分号分隔;

c("\\\\Server-01?6cf038ea-d583-4860-9488-67ee59c767c2\\expnum.2PDT35103?6438;TimeMethod=AtOrBefore;UOM=inHg;pointtype=Float32;displaydigits=1", 
"\\\\Server02-01?6cf038ea-d583-4860-9488-67ee59c767c2\\testnum.2BTAVGBARPR.OUT?6449;TimeMethod=AtOrBefore;UOM=inHg", 
"\\\\Server02-01?6cf038ea-d583-4860-9488-67ee59c767c2\\testnum3.2PT39248S.XQ01?6453;TimeMethod=AtOrBefore;UOM=psia;pointtype=Float32;displaydigits=1")

In the above example, I'd like to extract inHg , inHg and psia respectively.在上面的例子中,我想分别提取inHginHgpsia So far I've tried using regmatches and regexec but haven't found anything that works for all three examples here:到目前为止,我已经尝试使用regmatchesregexec ,但没有找到适用于所有三个示例的任何内容:

regex_func <- function(string, ptrn){
  return(regmatches(x = string, m = regexec(pattern = ptrn, text = string))[[1]][2])
}

> sapply(mystr, function(z){ regex_func(string = z, ptrn = '.*UOM=(.*)[$;]?')}, USE.NAMES = F)
[1] "inHg;pointtype=Float32;displaydigits=1" "inHg"                                  
[3] "psia;pointtype=Float32;displaydigits=1"

> sapply(mystr, function(z){ regex_func(string = z, ptrn = '.*UOM=(.*)[$;]+?')}, USE.NAMES = F)
[1] "inHg" NA     "psia"

> sapply(mystr, function(z){ regex_func(string = z, ptrn = '.*UOM=(.*)[$;]')}, USE.NAMES = F)
[1] "inHg;pointtype=Float32" NA                       "psia;pointtype=Float32"

> sapply(mystr, function(z){ regex_func(string = z, ptrn = '.*UOM=(.*)[$;]{0,1}')}, USE.NAMES = F)
[1] "inHg;pointtype=Float32;displaydigits=1" "inHg"                                  
[3] "psia;pointtype=Float32;displaydigits=1"

I'm not tied to regmatches and am open to using other functions/packages as well eg stringr , stringi我不依赖于regmatches ,也愿意使用其他函数/包,例如stringrstringi

EDIT编辑

Added.添加。 sample data Frame with all info - not all ConfigString elements have UOM .包含所有信息的示例数据帧 - 并非所有ConfigString元素都有UOM

structure(list(Name = c("Ambient Pressure", "Ambient RH", "Ambient Temperature", 
"Average Exhaust Gas Temp", "Bellmouth Temperature", "Compressor Discharge Pressure", 
"Compressor Discharge Temperature", "Current Power Output", "Degradation in Heat Rate (Comp Effic)", 
"Degradation in Power Output (Comp Effic)", "DirtPenetratingToEngineSinceLastWash", 
"Fuel Gas Temperature", "Fuel Heating Value (by volume)", "Fuel Volumetric Flow", 
"GT Fired Hours", "HRSG HP Steam Outlet Mass Flow", "HRSG HP Steam Outlet Pressure", 
"HRSG HP Steam Outlet Temperature", "HRSG IP Steam Outlet Mass Flow", 
"HRSG IP Steam Outlet Pressure", "HRSG IP Steam Outlet Temperature", 
"HRSG LP Steam Outlet Mass Flow", "HRSG LP Steam Outlet Pressure", 
"HRSG LP Steam Outlet Temperature", "Inlet Guide Vane Position", 
"Inlet system pressure drop", "Steam Injection Flow", "Steam Injection Pressure", 
"Steam Injection Temp"), DefaultUnitsName = c("kilopascal", "percent", 
"degree Celsius", "degree Celsius", "degree Celsius", "kilopascal", 
"degree Celsius", "megawatt", "kilojoule per kilowatt-hour", 
"megawatt", "gram", "degree Celsius", "BTU per standard cubic foot", 
"standard cubic foot per second", "hour", "kilogram per second", 
"kilopascal", "degree Celsius", "kilogram per second", "kilopascal", 
"degree Celsius", "kilogram per second", "kilopascal", "degree Celsius", 
"degree", "kilopascal", "kilogram per second", "pound-force per square inch", 
"degree Celsius"), DefaultUnitsNameAbbreviation = c("kPa", "%", 
"°C", "°C", "°C", "kPa", "°C", "MW", "kJ/kWh", "MW", "g", "°C", 
"BTU/scf", "scfs", "h", "kg/s", "kPa", "°C", "kg/s", "kPa", "°C", 
"kg/s", "kPa", "°C", "°", "kPa", "kg/s", "psi", "°C"), ConfigString = c("\\\\#\\asset1.2BTAVGBARPR.OUT?6449;TimeMethod=AtOrBefore;UOM=inHg", 
"\\\\#\\asset1.2BTAVGHUM.OUT?6423;TimeMethod=AtOrBefore", "\\\\#\\asset1.2BTAVGAMBTEMP.OUT?6446;TimeMethod=AtOrBefore;UOM=°F", 
"\\\\#\\asset1.2TEAVTX.ZQ01?6456;TimeMethod=AtOrBefore;UOM=°F", 
"\\\\#\\asset1.BT0110.CTGgtAIte01a?6802;TimeMethod=AtOrBefore;UOM=°F;pointtype=Float32;displaydigits=1", 
"\\\\#\\asset1.2PT39248S.XQ01?6453;TimeMethod=AtOrBefore;UOM=psia;pointtype=Float32;displaydigits=1", 
"\\\\#\\asset1.2TE35401S.XQ02?6457;TimeMethod=AtOrBefore;UOM=°F;pointtype=Float32;displaydigits=1", 
"\\\\#\\asset1.2JT38601S.XQ01?6450;TimeMethod=AtOrBefore;pointtype=Float32;displaydigits=1", 
"\\\\#\\Degradation in Heat Rate (Comp Effic)?6275;TimeMethod=AtOrBefore", 
"\\\\#\\Degradation in Power Output (Comp Effic)?6274;TimeMethod=AtOrBefore", 
"\\\\#\\Dust?6273;TimeMethod=AtOrBefore", 
"\\\\#\\asset1.2TE36112.XQ01?6454;TimeMethod=AtOrBefore;UOM=°F;pointtype=Float32;displaydigits=1", 
"\\\\#\\asset1.2FC54SUM.XQ01?6448;TimeMethod=AtOrBefore;pointtype=Float32;displaydigits=1", 
"\\\\#\\asset1.BT0110.CTGgtFGvl01a?6801;TimeMethod=AtOrBefore;pointtype=Float32;displaydigits=1", 
"\\\\#\\asset1.2CTGFiredHours?6800;TimeMethod=AtOrBefore;pointtype=Float32;displaydigits=1", 
"\\\\#\\asset1.2FT5050S.XQ01?6455;TimeMethod=AtOrBefore;pointtype=Float32;displaydigits=1", 
"\\\\#\\asset1.2PT5000S.XQ01?6799;TimeMethod=AtOrBefore;UOM=psi;pointtype=Float32;displaydigits=1", 
"\\\\#\\asset1.2TE5020S.XQ01?6798;TimeMethod=AtOrBefore;UOM=°F;pointtype=Float32;displaydigits=1", 
"\\\\#\\asset1.2FT5150S.XQ01?6447;TimeMethod=AtOrBefore;pointtype=Float32;displaydigits=1", 
"\\\\#\\asset1.2PT5100S.XQ01?6797;UOM=psi;pointtype=Float32;displaydigits=1", 
"\\\\#\\asset1.2TE5120S.XQ01?6796;TimeMethod=AtOrBefore;UOM=°F;pointtype=Float32;displaydigits=1", 
"\\\\#\\asset1.2FT5250S.XQ01?6443;TimeMethod=AtOrBefore;pointtype=Float32;displaydigits=1", 
"\\\\#\\asset1.2PT5200S.XQ01?6795;TimeMethod=AtOrBefore;UOM=psi;pointtype=Float32;displaydigits=1", 
"\\\\#\\asset1.2TE5220S.XQ01?6794;TimeMethod=AtOrBefore;UOM=°F;pointtype=Float32;displaydigits=1", 
"\\\\#\\asset1.2ZT35203.XQ01?6432;TimeMethod=AtOrBefore;pointtype=Float32;displaydigits=1", 
"\\\\#\\asset1.2PDT35103?6438;TimeMethod=AtOrBefore;UOM=inHg;pointtype=Float32;displaydigits=1", 
"\\\\#\\asset1.2FT36602X.ZQ01?6792;TimeMethod=AtOrBefore;pointtype=Float32;displaydigits=1", 
"\\\\#\\asset1.2PT245.XQ01?6793;TimeMethod=AtOrBefore;pointtype=Float32;displaydigits=1", 
"\\\\#\\asset1.2TE240.XQ01?6791;TimeMethod=AtOrBefore;UOM=°F;pointtype=Float32;displaydigits=1"
)), row.names = c(NA, -29L), class = c("data.table", "data.frame"
))

# 15 items returned 
> regmatches(x = dat$ConfigString, regexpr(pattern = '[?;]UOM=\\K[^;]+', text = dat$ConfigString, perl = T))
 [1] "inHg" "°F"   "°F"   "°F"   "psia" "°F"   "°F"   "psi"  "°F"   "psi"  "°F"   "psi"  "°F"   "inHg" "°F"

Using the chosen solution with this data:将所选解决方案与此数据一起使用:

# operation on a vector
> dat[, uom := regmatches(ConfigString, regexpr(pattern = '[?;]UOM=\\K[^;]+', ConfigString, perl = T))]

# using := operator in data.table
> dat[, uom := regmatches(ConfigString, regexpr(pattern = '[?;]UOM=\\K[^;]+', ConfigString, perl = T))]
Error in `[.data.table`(dat, , `:=`(uom, regmatches(ConfigString, regexpr(pattern = "[?;]UOM=\\K[^;]+",  : 
  Supplied 15 items to be assigned to 29 items of column 'uom'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.

Using stringr使用stringr

> stringr::str_replace(dat$ConfigString, ".*[?;]UOM=([^;]+).*", "\\1")
 [1] "inHg"                                                                                          
 [2] "\\\\#\\asset1.2BTAVGHUM.OUT?6423;TimeMethod=AtOrBefore"                                        
 [3] "°F"                                                                                            
 [4] "°F"                                                                                            
 [5] "°F"                                                                                            
 [6] "psia"                                                                                          
 [7] "°F"                                                                                            
 [8] "\\\\#\\asset1.2JT38601S.XQ01?6450;TimeMethod=AtOrBefore;pointtype=Float32;displaydigits=1"     
 [9] "\\\\#\\Degradation in Heat Rate (Comp Effic)?6275;TimeMethod=AtOrBefore"                       
[10] "\\\\#\\Degradation in Power Output (Comp Effic)?6274;TimeMethod=AtOrBefore"                    
[11] "\\\\#\\Dust?6273;TimeMethod=AtOrBefore"                                                        
[12] "°F"                                                                                            
[13] "\\\\#\\asset1.2FC54SUM.XQ01?6448;TimeMethod=AtOrBefore;pointtype=Float32;displaydigits=1"      
[14] "\\\\#\\asset1.BT0110.CTGgtFGvl01a?6801;TimeMethod=AtOrBefore;pointtype=Float32;displaydigits=1"
[15] "\\\\#\\asset1.2CTGFiredHours?6800;TimeMethod=AtOrBefore;pointtype=Float32;displaydigits=1"     
[16] "\\\\#\\asset1.2FT5050S.XQ01?6455;TimeMethod=AtOrBefore;pointtype=Float32;displaydigits=1"      
[17] "psi"                                                                                           
[18] "°F"                                                                                            
[19] "\\\\#\\asset1.2FT5150S.XQ01?6447;TimeMethod=AtOrBefore;pointtype=Float32;displaydigits=1"      
[20] "psi"                                                                                           
[21] "°F"                                                                                            
[22] "\\\\#\\asset1.2FT5250S.XQ01?6443;TimeMethod=AtOrBefore;pointtype=Float32;displaydigits=1"      
[23] "psi"                                                                                           
[24] "°F"                                                                                            
[25] "\\\\#\\asset1.2ZT35203.XQ01?6432;TimeMethod=AtOrBefore;pointtype=Float32;displaydigits=1"      
[26] "inHg"                                                                                          
[27] "\\\\#\\asset1.2FT36602X.ZQ01?6792;TimeMethod=AtOrBefore;pointtype=Float32;displaydigits=1"     
[28] "\\\\#\\asset1.2PT245.XQ01?6793;TimeMethod=AtOrBefore;pointtype=Float32;displaydigits=1"        
[29] "°F"  

You can use the following base R PCRE regex solution:您可以使用以下基本 R PCRE 正则表达式解决方案:

[?;]UOM=\K[^;]+

Alternatively, a stringr solution like或者,像这样的stringr解决方案

library(stringr)
str_match(x, "[?;]UOM=([^;]+)")[,2]

See the regex demo .请参阅正则表达式演示 Details :详情

  • [?;] - a ? [?;] - 一个? or ;; char字符
  • UOM= - a UOM= substring计量UOM= - 计量UOM= substring
  • \K - match reset operator \K -匹配重置运算符
  • [^;]+ - one or more chars other than ; [^;]+ - 一个或多个除;以外的字符char.字符。

See the R demo :请参阅R 演示

x <- c("\\\\Server-01?6cf038ea-d583-4860-9488-67ee59c767c2\\expnum.2PDT35103?6438;TimeMethod=AtOrBefore;UOM=inHg;pointtype=Float32;displaydigits=1", 
"\\\\Server02-01?6cf038ea-d583-4860-9488-67ee59c767c2\\testnum.2BTAVGBARPR.OUT?6449;TimeMethod=AtOrBefore;UOM=inHg", 
"\\\\Server02-01?6cf038ea-d583-4860-9488-67ee59c767c2\\testnum3.2PT39248S.XQ01?6453;TimeMethod=AtOrBefore;UOM=psia;pointtype=Float32;displaydigits=1",
"\\\\Server02-01?6cf038ea-d583-4860-9488-67ee59c767c2\\testnum.2TE36112.XQ01?6454;TimeMethod=AtOrBefore;UOM=°F;pointtype=Float32;displaydigits=1")
regmatches(x, regexpr("[?;]UOM=\\K[^;]+", x, perl=TRUE))
## => [1] "inHg" "inHg" "psia" "°F"  
library(stringr)
str_match(x, "[?;]UOM=([^;]+)")[,2]
## => [1] "inHg" "inHg" "psia" "°F"  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM