[英]R: Extract distinct pattern from KeyValue list
I have a dataset which looks similar to: 我有一个类似于的数据集:
quest<-data.frame(city=c("Atlanta","New York","Atlanta","Tampa"), key_value=c("rev=63;code=ATL;qty=1;zip=45987","rev=10.60|34;qty=1|2;zip=12686|12694;code=NY","code=ATL;rev=12;qty=1;zip=74268","rev=3|24|8;qty=1|6|3;code=TPA;zip=33684|36842|30254"))
which corresponds to: 对应于:
city key_value
1 Atlanta rev=63;code=ATL;qty=1;zip=45987
2 New York rev=10.60|34;qty=1|2;zip=12686|12694;code=NY
3 Atlanta code=ATL;rev=12;qty=1;zip=74268
4 Tampa rev=3|24|8;qty=1|6|3;code=TPA;zip=33684|36842|30254
I am trying to extract only one of the key value pattern ("code") out of the data which looks like the below: 我正在尝试从看起来像下面的数据中仅提取键值模式(“代码”)之一:
city code
1 Atlanta ATL
2 New York NY
3 Atlanta ATL
4 Tampa TPA
We can do this with Regex using a positive lookbehind 我们可以使用正则表达式来使用Regex做到这一点
quest$code <- gsub(".*(?<=code=)(\\w+)(;|$).*", "\\1", quest$key_value, perl = TRUE)
.*
- Match up to our lookbehind .*
-与我们的后代相匹配
(?<=code=)
- match the place in the string where the preceding characters are "code=" (?<=code=)
-匹配字符串中前面的字符为“ code =“的位置
(\\\\w+)
- match the code and capture it in group one. (\\\\w+)
-匹配代码并将其捕获到第一组中。
(;|$)
- match a semi-colon or the end of the string (in the case of NY there is no semi-colon afterwards) (;|$)
-匹配分号或字符串的末尾(对于NY,此后没有分号)
.*
- match the remainder of the string .*
-匹配字符串的其余部分
city key_value code
1 Atlanta rev=63;code=ATL;qty=1;zip=45987 ATL
2 New York rev=10.60|34;qty=1|2;zip=12686|12694;code=NY NY
3 Atlanta code=ATL;rev=12;qty=1;zip=74268 ATL
4 Tampa rev=3|24|8;qty=1|6|3;code=TPA;zip=33684|36842|30254 TPA
Live example 现场例子
https://regex101.com/r/UM7Cim/4 https://regex101.com/r/UM7Cim/4
You can use strcapture
which returns the captured parts of regexes: 您可以使用
strcapture
返回捕获的正则表达式部分:
cbind(quest,
strcapture(
"code=([^;]*)",
quest$key_value,
data.frame(code=character())))
the regex "code=([^;]*)"
looks for the text code=
and then captures everything that isn't a semicolon. regex
"code=([^;]*)"
查找文本code=
,然后捕获所有不是分号的内容。 The data frame argument specifies the name and type of the returned value. 数据框参数指定返回值的名称和类型。 Here I use
cbind
to return a data frame with an extra column. 在这里,我使用
cbind
返回带有额外列的数据帧。
> cbind(quest, strcapture("code=([^;]*)",quest$key_value,data.frame(code=character())))
city key_value code
1 Atlanta rev=63;code=ATL;qty=1;zip=45987 ATL
2 New York rev=10.60|34;qty=1|2;zip=12686|12694;code=NY NY
3 Atlanta code=ATL;rev=12;qty=1;zip=74268 ATL
4 Tampa rev=3|24|8;qty=1|6|3;code=TPA;zip=33684|36842|30254 TPA
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.