[英]How do I identify what is causing thrashing in my R function?
我寫了一個函數來匿名化給定一些鍵的數據框中的名字,一旦它匿名了很多名字,它就會爬行,但我不明白為什么。
有問題的數據框是一組通過 Twitter API 收集的 4733 條推文,其中每行是一條包含 32 列數據的推文。 無論名稱出現在哪一行,這些名稱都將被匿名化,因此我不想將該功能限制為僅查看這 32 列中的幾列。
關鍵是一個包含 211121 對真假姓名的數據幀,真實和假名在數據幀中都是唯一的。 在匿名了大約 10 萬個名字后,該功能會大大減慢。
該函數如下所示:
pseudonymize <- function(df, key) {
for(name in key$realNames) {
df <- as.data.frame(apply(df, 2, function(column) gsub(name, key[key$realNames == name, 2], column)))
}
}
這里是否有一些明顯的事情會導致速度變慢? 我完全沒有優化代碼以提高速度的經驗。
編輯1:
以下是要匿名化的數據框中的幾行。
"https://twitter.com/__jgil/statuses/825559753447313408","__jgil",0.000576911235261567,756,4,13,17,7,16,23,10,0.28166915052161,0.390123456790124,0.00271311644806025,0.474529795261862,0.00641025649383664,"@jadahung20 GIRL I am tooooooo salty tonight lolll","lolll","adjoint","anglais","indefini","anglais","anglais","non","iPhone, Twitter",4057,214,241,"Canada","Nouvelle-Ecosse","Middleton","indefini","Shari"
"https://twitter.com/__paigewhite/statuses/827988259573788673","__paigewhite",0,1917,0,8,8,0,9,9,16,0.143476044852192,0.162056634159209,0.000172947386274259,0,0,"@abbytutty_ i miss emily lololol _Ù÷â_Ù÷É","lololol","adjoint","anglais","indefini","anglais","anglais","non","iPhone, Twitter",8366,392,661,"Canada","Nouvelle-Ecosse","indefini","indefini","Shari"
"https://twitter.com/_brookehynes/statuses/821022926287884288","_brookehynes",0,1917,1,6,7,1,7,8,1,1,1,0.000196850793912616,0.00393656926735126,0.200000002980232,"@tdesj3 @belle lol yea doubt it.","lol","adjoint","indefini","anglais","anglais","anglais","non","iPhone, Twitter",1184,87,70,"Canada","Nouvelle-Ecosse","Halifax","indefini","Shari"
這是關鍵的幾行。
"","realNames","fakeNames"
"1","________","Tajid_Pinkley"
"2","____________aho","Monica_Yujiri"
"3","___________ass","Alexander_Garay-Grajeda"
編輯2:
我已將 DF 簡化為僅需要匿名化的兩列,這使事情變得更快,但在處理了大約 155k 個名稱后它仍然會失敗。
根據評論中的要求,這里是要匿名化的 DF 前三行的dput()
輸出。
structure(list(
utilisateur = c("___Yeliab", "__courtlezz", "__courtlezz"),
texte = c("@EmilyIsPro ik lol", "@NikkiErica21 there was a sighting in sunset ridge too. Keep Winnie and bob safe lol", "@NikkiErica21 lol yes _Ã\231։")
),
row.names = c(NA, 3L),
class = "data.frame")
這是鍵的前三行的dput()
。
structure(list(
realNames = c("________", "____________aho", "___________ass"),
fakeNames = c("Abhinav_Chang", "Caleb_Dunn-Sparks", "Taryn_Hunzicker")
),
row.names = c(NA, 3L),
class = "data.frame")
將數據作為向量而不是 data.frame 處理會更有效率。 我遇到了一些編碼問題,因此使用iconv
將文本轉換為 UTF-8; 如果名稱包含非 ASCII 字符,則需要進行一些處理。
key1 <- data.frame(
realNames = c("________", "____________aho", "___________ass",
"___Yeliab", "__courtlezz", "NikkiErica21", "EmilyIsPro", "aho"),
fakeNames = c("Abhinav_Chang", "Caleb_Dunn-Sparks", "Taryn_Hunzicker",
"A_A", "B_B", "C_C", "D_D", "E_E"),
stringsAsFactors = FALSE
)
pseudonymize1 <- function(df, key) {
mat <- as.matrix(df)
dims <- attr(mat, which = "dim")
cnam <- colnames(df)
vec <- iconv(unclass(mat), from = "latin1", to = "UTF-8")
for (name in split(key, f = seq_len(nrow(key)))) {
vec <- gsub(
vec,
pattern = name$realNames,
replacement = name$fakeNames,
fixed = TRUE)
}
mat <- vec
attr(mat, which = "dim") <- dims
df <- as.data.frame(mat, stringsAsFactors = FALSE)
colnames(df) <- cnam
df
}
pseudonymize1(df1, key1)
# utilisateur texte
# 1 A_A @D_D ik lol
# 2 B_B @C_C there was a sighting in sunset ridge too. Keep Winnie and bob safe lol
# 3 B_B @C_C lol yes _Ã\u0083\u0099Ã\u0083·Ã\u0083¢
library(microbenchmark)
microbenchmark(
pseudonymize(df1, key1),
pseudonymize1(df1, key1)
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# pseudonymize(df1, key1) 1842.554 1885.6750 2131.089 1994.755 2294.6850 3007.371 100 b
# pseudonymize1(df1, key1) 287.683 306.1905 333.678 314.950 339.8705 497.301 100 a
我對 155k 名稱的一個擔憂是,當作為正則表達式搜索時,您會發現其他名稱中包含的名稱。 這可能是真名中的真名(例如 EmilyIsPro 中的 Emily),或者之前替換的假名中的真名。 您將需要對此進行測試,並考慮使用隨機散列而不是類似名稱的假名。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.