简体   繁体   English

字符串匹配两个数据框中的列,仅当is.na时替换另一个列值

[英]String Match columns in two dataframes, replace another column value only if is.na

test.data <- data.frame(summary = c("Execute commands as root via buffer overflow in Tooltalk database server (rpc.ttdbserverd)."
                                 ,"Information from SSL-encrypted sessions via PKCS #1."
                                 ,"ip_input.c in BSD-derived TCP/IP implementations allows remote attackers to cause a denial of service (crash or hang) via crafted packets."),
                        wascname=c(NA, NA, "Improper Input Handling"),stringsAsFactors = FALSE)

wascNames <- data.frame(wascname=c("Abuse of Functionality","Brute Force","Buffer Overflow","Content Spoofing"
                                   ,"Credential/Session Prediction","Cross-Site Scripting","Cross-Site Request Forgery","Denial of Service"
                                   ,"Fingerprinting","Format String","HTTP Response Smuggling","HTTP Response Splitting"
                                   ,"HTTP Request Smuggling","HTTP Request Splitting","Integer Overflows","LDAP Injection"
                                   ,"Mail Command Injection","Null Byte Injection","OS Commanding","Path Traversal"
                                   ,"Predictable Resource Location","Remote File Inclusion (RFI)","Routing Detour","Session Fixation"
                                   ,"SOAP Array Abuse","SSI Injection","SQL Injection","URL Redirector Abuse"
                                   ,"XPath Injection","XML Attribute Blowup","XML External Entities","XML Entity Expansion"
                                   ,"XML Injection","XQuery Injection","Cross-site Scripting","Directory Indexing"
                                   ,"Improper Filesystem Permissions","Improper Input Handling","Improper Output Handling","Information Leakage"
                                   ,"Insecure Indexing","Insufficient Anti-Automation","Insufficient Authentication","Insufficient Authorization"
                                   ,"Insufficient Password Recovery","Insufficient Process Validation","Insufficient Session Expiration","Insufficient Transport Layer Protection"
                                   ,"Remote File Inclusion","URl Redirector Abuse"),stringsAsFactors = FALSE)

Below is the code I am have been trying to fix. 以下是我一直试图修复的代码。 If test.data$summary contains string in wascNames$wascname , replace test.data$wascname only if is.na : 如果test.data$summarywascNames$wascname包含字符串, test.data$wascname仅在is.na替换test.data$wascname is.na

test.data$wascname<-sapply(test.data$summary, function(x) 
      ifelse(identical(wascNames$wascname[str_detect(x,regex(wascNames$wascname, ignore_case = T))&
            is.na(test.data$wascname)==TRUE], character(0)),test.data$wascname,
            wascNames$wascname[str_detect(x,regex(wascNames$wascname, ignore_case = T))==TRUE]))

I want the following output: 我想要以下输出:

在此处输入图片说明

Thank you in advance. 先感谢您。 Thought of using for loop, but would be too slow for 200000 obs. 想到使用for循环,但是对于200000 obs来说太慢了。

I believe this should work: 我相信这应该有效:

test.data$wascname2 <- sapply(1:nrow(test.data), function(x)  ifelse(is.na(test.data$wascname[x]), 
                                              wascNames$wascname[str_detect(test.data$summary[x], regex(wascNames$wascname, ignore_case = TRUE))],
                                              test.data$wascname[x]))

test.data$wascname2
#[1] "Buffer Overflow"         NA                        "Improper Input Handling"

It still loops with sapply , but I think that's unavoidable given your data structure (ie for each string, you want to look it up in your wascNames$wascname table). 它仍然与sapply循环,但是鉴于您的数据结构(即,对于每个字符串,您都想在wascNames$wascname表中查找它),这是不可避免的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM