简体   繁体   English

将函数应用于data.table的每一行

[英]Applying a function to each row of a data.table

I looking for a way to efficiently apply a function to each row of data.table. 我正在寻找一种方法来有效地将函数应用于data.table的每一行。 Let's consider the following data table: 让我们考虑以下数据表:

library(data.table)
library(stringr)

x <- data.table(a = c(1:3, 1), b = c('12 13', '14 15', '16 17', '18 19'))
> x
   a     b
1: 1 12 13
2: 2 14 15
3: 3 16 17
4: 1 18 19

Let's say I want to split each element of column b by space (thus yielding two rows for each row in the original data) and join the resulting data tables. 假设我想按空格分割列b的每个元素(从而为原始数据中的每一行产生两行)并连接结果数据表。 For the example above, I need the following result: 对于上面的示例,我需要以下结果:

   a V1
1: 1 12
2: 1 13
3: 2 14
4: 2 15
5: 3 16
6: 3 17
7: 1 18
8: 1 19

The following would work if column a has only unique values : 如果列a只有唯一值,则以下内容将起作用:

x[, list(str_split(b, ' ')[[1]]), by = a]

The following almost works (unless there are some identical rows in the original data table), but is ugly when x has many columns and copies column b to the result, which I would like to avoid. 以下几乎可以工作(除非原始数据表中有一些相同的行),但是当x有很多列并将列b复制到结果时很难看,我想避免这种情况。

>     x[, list(str_split(b, ' ')[[1]]), by = list(a,b)]
   a     b V1
1: 1 12 13 12
2: 1 12 13 13
3: 2 14 15 14
4: 2 14 15 15
5: 3 16 17 16
6: 3 16 17 17
7: 1 18 19 18
8: 1 18 19 19

What would be the most efficient and idiomatic way to solve this problem? 解决这个问题最有效和惯用的方法是什么?

How about : 怎么样 :

x
   a     b
1: 1 12 13
2: 2 14 15
3: 3 16 17
4: 1 18 19

x[,list(a=rep(a,each=2), V1=unlist(strsplit(b," ")))]
   a V1
1: 1 12
2: 1 13
3: 2 14
4: 2 15
5: 3 16
6: 3 17
7: 1 18
8: 1 19

Generalized solution given comment : 给出评论的广义解决方案:

x[,{s=strsplit(b," ");list(a=rep(a,sapply(s,length)), V1=unlist(s))}]
x[, .(a,strsplit(b,' ')), by = .I]

看起来更加estetic

One option would be to add a row number 一种选择是添加行号

x[, r := 1:nrow(x)]

and then group by r : 然后按r分组:

x[, list(a, str_split(b, ' ')[[1]]), by = r]

I'm wondering if there are better solutions? 我想知道是否有更好的解决方案?

The most effective and idiomatic approach is to have a vectorized function. 最有效和惯用的方法是具有矢量化功能。

In this case, some kind of regex will do what you want 在这种情况下,某种regex会做你想要的

 x[, V1 := gsub(" [[:alnum:]]*", "", b)]

   a     b V1
1: 1 12 13 12
2: 2 14 15 14
3: 3 16 17 16
4: 1 18 19 18

If you want to return the each split component, and you know there are two in each one, you can use Map to coerce the result of strsplit into the correct form 如果要返回每个拆分组件,并且您知道每个拆分组件中有两个,则可以使用Mapstrsplit的结果strsplit转换为正确的形式

x[, c('b1','b2')  := do.call(Map, c(f = c, strsplit(b, ' ')))]



x
   a     b b1 b2
1: 1 12 13 12 13
2: 2 14 15 14 15
3: 3 16 17 16 17
4: 1 18 19 18 19
x[, .(a,strsplit(b,' ')), by=1:nrow(x)]

by=nrow(x)是一种by=nrow(x)强制每行1行的简单方法

The dplyr / tidyr approach also works with data tables. dplyr / tidyr方法也适用于数据表。

library(dplyr)
library(tidyr)
x %>% 
  separate(b, into = c("b1", "b2")) %>% 
  gather(b, "V1", b1:b2) %>%
  arrange(V1) %>%
  select(a, V1)

Or, using the standard evaluation forms: 或者,使用标准评估表:

x %>% 
  separate_("b", into = c("b1", "b2")) %>% 
  gather_("b", "V1", c("b1", "b2")) %>%
  arrange_(~ V1) %>%
  select_(~ a, ~ V1)

The case of different numbers of values in the b column is only slightly more complicated. b列中不同数量的值的情况仅稍微复杂一些。

library(stringr)

x2 <- data.table(
  a = c(1:3, 1), 
  b = c('12 13', '14', '15 16 17', '18 19')
)

n <- max(str_count(x2$b, " ")) + 1
b_cols <- paste0("b", seq_len(n))
x2 %>% 
  separate_("b", into = b_cols, extra = "drop") %>% 
  gather_("b", "V1", b_cols) %>%
  arrange_(~ V1) %>%
  select_(~ a, ~ V1)

Looking at input and desired output, this should work - 看看输入和所需的输出,这应该工作 -

x <- data.frame(a=c(1,2,3,1),b=c("12 13","14 15","16 17","18 19"))
data.frame(a=rep(x$a,each=2), new_b=unlist(strsplit(as.character(x$b)," ")))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM