[英]Compare one column in a dataframe to two columns of another dataframe
[英]Compare dataframe column to another dataframe column
我有一個包含頁面路徑的數據框列(讓我們稱之為A):
pagePath
/text/other_text/123-string1-4571/text.html
/text/other_text/string2/15-some_other_txet.html
/text/other_text/25189-string3/45112-text.html
/text/other_text/text/string4/5418874-some_other_txet.html
/text/other_text/string5/text/some_other_txet-4157/text.html
/text/other_text/123-text-4571/text.html
/text/other_text/125-text-471/text.html
我有另一個字符串數據幀列讓我們調用它(B)(兩個數據幀不同,它們沒有相同的行數)。
這是我在數據框B中的列的示例:
names
string1
string11
string4
string3
string2
string10
string5
string100
我想要做的是檢查我的頁面路徑(A)是否包含來自我的其他數據幀(B)的字符串。
我遇到了困難,因為我的兩個數據幀長度不一樣,而且數據沒有組織。
預期輸出
我希望得到這個輸出結果:
pagePath names exist
/text/other_text/123-string1-4571/text.html string1 TRUE
/text/other_text/string2/15-some_other_txet.html string2 TRUE
/text/other_text/25189-string3/45112-text.html string3 TRUE
/text/other_text/text/string4/5418874-some_other_txet.html string4 TRUE
/text/string5/text/some_other_txet-4157/text.html string5 TRUE
/text/other_text/123-text-4571/text.html NA FALSE
/text/other_text/125-text-471/text.html NA FALSE
如果我的問題需要進一步澄清,請提及此問題。
我們可以使用grepl()
生成exist
列
# Collapse B$names into one string with "|"
onestring <- paste(B$names, collapse = "|")
# Generate new column
A$exist <- grepl(onestring, A$pagePath)
不太好,因為包含for循環:
names <- rep(NA, length(A$pagePath))
exist <- rep(FALSE, length(A$pagePath))
for (name in B$names) {
names[grep(name, A$pagePath)] <- name
exist[grep(name, A$pagePath)] <- TRUE
}
我們可以在stringr
包中使用str_extract_all
,但NA
被替換為character(0)
所以我們必須更改它
df$names <- as.character(str_extract_all(df$pagePath, "string[0-9]+"))
df$exist <- df$names %in% df1$names
df[df=="character(0)"] <- NA
df
# pagePath names exist
#1 /text/other_text/123-string1-4571/text.html string1 TRUE
#2 /text/other_text/string2/15-some_other_txet.html string2 TRUE
#3 /text/other_text/25189-string3/45112-text.html string3 TRUE
#4 /text/other_text/text/string4/5418874-some_other_txet.html string4 TRUE
#5 /text/other_text/string5/text/some_other_txet-4157/text.html string5 TRUE
#6 /text/other_text/123-text-4571/text.html <NA> FALSE
#7 /text/other_text/125-text-471/text.html <NA> FALSE
數據
dput(df)
structure(list(pagePath = structure(c(1L, 5L, 4L, 7L, 6L, 2L,
3L), .Label = c("/text/other_text/123-string1-4571/text.html",
"/text/other_text/123-text-4571/text.html", "/text/other_text/125-text-471/text.html",
"/text/other_text/25189-string3/45112-text.html", "/text/other_text/string2/15-some_other_txet.html",
"/text/other_text/string5/text/some_other_txet-4157/text.html",
"/text/other_text/text/string4/5418874-some_other_txet.html"), class = "factor")), .Names = "pagePath", class = "data.frame", row.names = c(NA,
-7L))
dput(df1)
structure(list(names = structure(c(1L, 4L, 7L, 6L, 5L, 2L, 8L,
3L), .Label = c("string1", "string10", "string100", "string11",
"string2", "string3", "string4", "string5"), class = "factor")), .Names = "names", class = "data.frame", row.names = c(NA,
-8L))
以下是使用apply的一種方法:
df$exist <- apply( df,1,function(x){as.logical(grepl(x[2],x[1]))} )
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.