簡體   English   中英

在R中最匹配的數據框中查找行

[英]find row in data frame with closest match in R

我在R中有一個數據框,其中包括3個屬性的5行(記錄)。 現在給定具有相同20個屬性的新記錄,就內容(值)而言,找到這10行中最相似的10條最好的方法是什么?

現有資料

Age Occupation Nationality,
23  Builder    German,
29  Worker     British,
45  Contractor Vietnamese,
24  Engineer   German,
28  Doctor     Indian,

新數據

23  Doctor German

預期產量

23  Builder    German

我想返回第1行,即上面的行,因為兩個屬性匹配

df<-data.frame(Age=c(23,29,45,24,28),Occupation=c("Builder","Worker","Contractor","Engineer","Doctor"),Nationality=c("German","British","Vietnamese","German","Indian"),stringsAsFactors=F)

newdata<-c(23,"Doctor","German")


df[which.max(apply(df,1,function(vec,dat){sum(vec==dat)},newdata)),]

  Age Occupation Nationality
1  23    Builder      German

如果是平局,您可以通過以下方式更好地匹配:

detmatches<-apply(df,1,function(vec,dat){sum(vec==dat)},newdata)
df[which(detmatches==max(detmatches)),]

你可以使用stringdiststringdistmethod=jaccard 通過使用Map ,我們正在將df的列與list newdata相應elements進行比較。 例如,來自df Age列用於與23進行stringdist比較,使用Doctor Occupation等等,以此類推...在應用stringdist函數之后,我們為每個列表元素獲取了長度等於nrow(df)數值。 使用Reduce將相應的值相加( + ),然后我們尋找該值是其中which.minminimum (輸出將是邏輯索引)。 該索引用於子集df

library(stringdist)
df[which.min(Reduce(`+`,Map(stringdist,df, newdata,
                                 method='jaccard'))),]

#  Age Occupation Nationality
#1  23    Builder      German

數據

df <-  structure(list(Age = c(23, 29, 45, 24, 28), Occupation = c("Builder", 
"Worker", "Contractor", "Engineer", "Doctor"), Nationality = c("German", 
"British", "Vietnamese", "German", "Indian")), .Names = c("Age", 
"Occupation", "Nationality"), row.names = c(NA, -5L), class = "data.frame")

newdata <- list(23,"Doctor","German")

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM