[英]R apply a user-defined function to all rows of a dataframe
我正在努力循環遍歷數據框中的列的行,然后使用當前行來定義將在函數中使用的參數。 這是示例數據框:
df <-
structure(list(child = c("A268", "A268497", "A268497BOX", "A268497BOX2",
"A268497BOX218", "A277", "A277A79", "A277A79091", "A277A790911",
"A277A79091144", "A492", "A492586", "A492586BOX", "A492586BOX1",
"A492586BOX144", "A492A69", "A492A69027", "A492A690271", "A492A69027144",
"A492A6902715K", "A492A6902719Y", "A492A690271BH", "A492A690271BI",
"A492A690271CQ", "A492A690271CS", "A492A690271CT", "A492A690271CU",
"A492A690271CV", "A492A690271CW", "A492A690271CX", "A492A690271CY",
"A492A690271DA", "A492A69028", "A492A690281", "A492A69028144",
"A492A69402", "A492A694021", "A492A69402144", "A492A70", "A492A70033",
"A492A700331", "A492A70033144", "A492A700332", "A492A70033244",
"A492A70034", "A492A700341", "A492A70034144", "A492A70035", "A492A700351",
"A492A70035144"), clvl = c(2, 3, 4, 5, 6, 2, 3, 4, 5, 6, 2, 3,
4, 5, 6, 3, 4, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 4,
5, 6, 4, 5, 6, 3, 4, 5, 6, 5, 6, 4, 5, 6, 4, 5, 6), parent = c("A",
"A268", "A268497", "A268497BOX", "A268497BOX2", "A", "A277",
"A277A79", "A277A79091", "A277A790911", "A", "A492", "A492586",
"A492586BOX", "A492586BOX1", "A492", "A492A69", "A492A69027",
"A492A690271", "A492A690271", "A492A690271", "A492A690271", "A492A690271",
"A492A690271", "A492A690271", "A492A690271", "A492A690271", "A492A690271",
"A492A690271", "A492A690271", "A492A690271", "A492A690271", "A492A69",
"A492A69028", "A492A690281", "A492A69", "A492A69402", "A492A694021",
"A492", "A492A70", "A492A70033", "A492A700331", "A492A70033",
"A492A700332", "A492A70", "A492A70034", "A492A700341", "A492A70",
"A492A70035", "A492A700351"), plvl = c(1, 2, 3, 4, 5, 1, 2, 3,
4, 5, 1, 2, 3, 4, 5, 2, 3, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 3, 4, 5, 3, 4, 5, 2, 3, 4, 5, 4, 5, 3, 4, 5, 3, 4, 5
)), row.names = c(NA, 50L), class = "data.frame")
我的目標是生成這個:
我嘗試使用循環並在循環內使用不同版本的apply
函數來做到這一點,但我無法做到這一點。 在這里,我需要定義 x 和 y 將是每次迭代時當前行的child
和pathString
。 有沒有一種干凈簡單的方法來做到這一點?
df[] <- apply(df,1,function(x,y) sub(x,y,x))
假設child
(或pathString
)中的字符數將繼續增加,如數據共享中所示,一種方法是使用purrr::accumulate
,它允許從先前的輸出中獲取輸入並按組應用它。
library(dplyr)
df %>%
group_by(gr = cumsum(c(TRUE, diff(nchar(child)) < 0))) %>%
mutate(ans = purrr::accumulate(pathString, ~sub(".*(/.*)",paste0(.x, "\\1"),.y)))
# child pathString gr ans
# <chr> <chr> <int> <chr>
# 1 A268 A/268 1 A/268
# 2 A268497 A268/497 1 A/268/497
# 3 A268497BOX A268497/BOX 1 A/268/497/BOX
# 4 A268497BOX2 A268497BOX/2 1 A/268/497/BOX/2
# 5 A268497BOX218 A268497BOX2/18 1 A/268/497/BOX/2/18
# 6 A277 A/277 2 A/277
# 7 A277A79 A277/A79 2 A/277/A79
# 8 A277A79091 A277A79/091 2 A/277/A79/091
# 9 A277A790911 A277A79091/1 2 A/277/A79/091/1
#10 A277A79091144 A277A790911/44 2 A/277/A79/091/1/44
在最終輸出中保留 group 的gr
列以闡明如何創建組。
我們也可以使用Reduce
在基礎 R 中實現相同的邏輯
apply_fun <- function(x, y) sub(".*(/.*)", paste0(x, "\\1"), y)
df$ans <- with(df, ave(pathString, cumsum(c(TRUE, diff(nchar(child)) < 0)),
FUN = function(x) Reduce(apply_fun, x, accumulate = TRUE)))
我設法使用以下代碼塊完成了它,但循環需要 75-80 秒,我想可能有更快的方法:
for(row in 1:nrow(df5)) {
x=df5[row,2] #child
y=df5[row,3] #pathString
g=df5[row,c('gr')]
df5$pathString[df5$gr==g] <- sub(x,y,df5$pathString[df5$gr==g])
df5$child[df5$gr==g] <- sub(x,y,df5$child[df5$gr==g])
}
請注意, gr
是根據clvl=2
填充的:
library(zoo)
df4$gr <- ifelse(df4$clvl==2,df4$child,NA)
df4$gr <- na.locf(df4$gr)
這就是df4
的制作方式:
df4 <- sqldf("select *, parent || replace(child,parent,'/') AS pathString FROM df ORDER BY child")
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.