通過分組變量計算Levenshtein /漢明距離

Question

我正在嘗試根據正確的響應（列MEM_Correct ）來計算參與者的響應（列MEM_Response ）的MEM_Correct 。 分組變量將是參與者的ID（在這種情況下，列SERIAL >每個參與者15個案例）。

dput(example)
structure(list(MEM_Correct = c("ZLHK", "RZKX", "DGWL", "BCJSP", 
"WRKTJ", "CHBXS", "HNDCWX", "SWVNDT", "WLDGPB", "DSHRKBV", "HCXLZWB", 
"HDNBVZC", "BCRHKVDM", "RVTBWKFS", "NWHVZFLD", "ZLHK", "RZKX", 
"DGWL", "BCJSP", "WRKTJ", "CHBXS", "HNDCWX", "SWVNDT", "WLDGPB", 
"DSHRKBV", "HCXLZWB", "HDNBVZC", "BCRHKVDM", "RVTBWKFS", "NWHVZFLD"
), MEM_Response = c("ZLHK", "RZKX", "DGWL", "BCJSP", "WRKLTJ", 
"CHBXS", "HNDCWX", "SWVDTN", "WLDGPB", "DSHRKBV", "HCXLZWB", 
"HDNBVZC", "BCRHKVDM", "RVTBWKFS", "NWHVZFLD", "ZLHK", "RZKX", 
"DGWL", "BCJSB", "WRKTJ", "CHBXA", "HDNDWX", "SWVNDT", "WLGPBD", 
"DSHKRBV", "WLGJHKK", "HDBNVZC", "BCHRKVBM", "RVGBKSNM", "NWHVZWHJ"
), SERIAL = c("4444", "4444", "4444", "4444", "4444", "4444", 
"4444", "4444", "4444", "4444", "4444", "4444", "4444", "4444", 
"4444", "5555", "5555", "5555", "5555", "5555", "5555", "5555", 
"5555", "5555", "5555", "5555", "5555", "5555", "5555", "5555"
)), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 
12L, 13L, 14L, 15L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 
26L, 27L, 28L, 29L, 30L, 31L), class = "data.frame")

我嘗試使用多種方法來計算准確性（即正確響應與實際響應之間的距離），但是到目前為止，我沒有收到令人滿意的輸出。

使用stringdist進行漢明和stringdist距離：

萊文施泰因：

example$MEM_Lev = stringdist(example$MEM_Correct, example$MEM_Response, method = c("lv"))

海明

example$MEM_Ham = stringdist(example$MEM_Correct, example$MEM_Response, method = c("hamming"))

問題：對於每種情況，我都有漢明距離，但是我將如何計算每個參與者的准確度，最終得出0到1之間的范圍（即0到100％的准確度）？ 漢明距離的問題還在於長度不同的情況（請參見第5行： WRKTJ與WRKLTJ ）會產生inf 。 所以使用Levenshtein距離可能會更好，對嗎？

然后，我嘗試了Levensthein距離的with()函數：

with(example, levenshteinSim(example$MEM_Correct, example$MEM_Response))

我認為這次是在0到1之間，這是向前邁出的一步。 再次進入第5行：WRKTJ（5個字母）與WRKLTJ（6個字母）的不同之處在於，后者在中間有一個額外的“ L”。 因此，必須進行1次單一編輯（在這種情況下為刪除），才能與正確的響應相匹配。 它的Levenshtein值為0.8333對應5/6正確（即使正確值只有5）。 我在使用正確的距離功能嗎？

最后，我的最后一個問題是：

如何匹配/計算每個參與者的平均准確度？ 我的所有參與者都有另一個df， 我想將每個人的示例均值的輸出與數據行合並，其中1行= 1個參與者。

我希望這是有道理的-否則，我可以嘗試提供更多信息。 如果您認為我沒有使用正確的方法，請隨時建議其他方法。

先感謝您！

Answer 1

如何定義“准確性”是一個方法決定，必須由您決定，文獻中可能會有一些參考，但這是一個建議。

example$lv.dist <- stringdist(example[,1], example[,2], method="lv")
head(example)
#   MEM_Correct MEM_Response SERIAL lv.dist
# 1        ZLHK         ZLHK   4444       0
# 2        RZKX         RZKX   4444       0
# 3        DGWL         DGWL   4444       0
# 4       BCJSP        BCJSP   4444       0
# 5       WRKTJ       WRKLTJ   4444       1
# 6       CHBXS        CHBXS   4444       0

aggregate(lv.dist ~ SERIAL, example, mean)
#   SERIAL  lv.dist
# 1   4444 0.200000
# 2   5555 1.866667

aggregate(lv.dist ~ SERIAL, example, function(x) round(mean(100/(1+x)), 2))
#   SERIAL lv.dist
# 1   4444   92.22
# 2   5555   54.17

# Using stringsim()
example$lv.sim <- stringsim(example[,1], example[,2], method="lv")

(agg <- aggregate(lv.sim ~ SERIAL, example, function(x) round(mean(x)*100, 2)))
#   SERIAL lv.sim
# 1   4444  96.67
# 2   5555  73.25

# Merging two data.frames is easy as long as they have a have a 
# column in common (SERIAL in this case)    
participants <- data.frame(age=7:9, SERIAL=c(5555, 4444, 1234))

merge(participants, agg)
#   SERIAL age lv.sim
# 1   4444   9  96.67
# 2   5555   8  73.25

merge(participants, agg, all=TRUE)
#   SERIAL age lv.sim
# 1   1234   9     NA
# 2   4444   8  96.67
# 3   5555   7  73.25

通過分組變量計算Levenshtein /漢明距離

問題描述

1 個解決方案

解決方案1
0 2019-06-25 10:46:27

通過分組變量計算Levenshtein /漢明距離

問題描述

1 個解決方案

解決方案1 0 2019-06-25 10:46:27

解決方案1
0 2019-06-25 10:46:27