简体   繁体   English

根据data.table中各组之间的匹配返回一个变量

[英]Return one variable based on match of another across groups in data.table

I'm new to data.table and don't fully understand it. 我是data.table的新手,并不完全了解它。 Suppose I have the following table of ngrams: 假设我有下表的ngrams:

require(data.table)
DT<-data.table(
  ngram=c("how","how are","how are you","how are you doing"),
  Freq=c(15000,1500,150,15),
  n=c(1,2,3,4),
  w1=c(37,37,37,37),
  w2=c(NA,13,13,13),
  w3=c(NA,NA,7,7),
  w4=c(NA,NA,NA,95)
)

> DT
               ngram  Freq n w1 w2 w3 w4
1:               how 15000 1 37 NA NA NA
2:           how are  1500 2 37 13 NA NA
3:       how are you   150 3 37 13  7 NA
4: how are you doing    15 4 37 13  7 95

Where n denotes the type of ngram (eg 1=unigram, 2=bigram, etc), w1 through w4 are integer indexes of the words in each ngram, and Freq is the count of ngram occurrence in the data. 其中n表示ngram的类型(例如1 = unigram,2 = bigram等),w1到w4是每个ngram中单词的整数索引,而Freq是数据中ngram出现的次数。

How would I get Freq of one ngram based on a match of one word in that ngram with one word in another ngram, eg for the bigram (n=2) 'how are' how would I get Freq of unigram 'how' by matching w1 of 'how are' with w1 of 'how'? 我如何基于一个ngram中的一个单词与另一个ngram中的一个单词的匹配来获得一个ngram的频率,例如,对于二字(n = 2),“怎么样”,我如何通过匹配来获得unigram“如何”的频率w1的“怎么样”与w1的“怎么样”? Or, for the trigram 'how are you', how would I get Freq of bigram 'how are' by matching w1+w2 of 'how are you' with w1+w2 of 'how are'? 或者,对于三元组“你好吗”,我如何通过将“你好”的w1 + w2与“你好”的w1 + w2匹配来获得二元组“你好”的频率?

I've tried, for example: 我尝试过,例如:

DT[n==2,B:=Freq[match(w1[n==1],w1[n==2])]]

and

DT[n==2,B:=Freq[which(w1[n==1]==w1[n==2])]]

But get only NAs: 但是仅获取NA:

               ngram  Freq n w1 w2 w3 w4  B
1:               how 15000 1 37 NA NA NA NA
2:           how are  1500 2 37 13 NA NA NA
3:       how are you   150 3 37 13  7 NA NA
4: how are you doing    15 4 37 13  7 95 NA

I would like to get: 我想得到:

               ngram  Freq n w1 w2 w3 w4     B
1:               how 15000 1 37 NA NA NA    NA
2:           how are  1500 2 37 13 NA NA 15000
3:       how are you   150 3 37 13  7 NA  1500
4: how are you doing    15 4 37 13  7 95   150

Any help greatly appreciated! 任何帮助,不胜感激!

You can go through row by row, find the 'w' columns to be used as joining keys and then perform the join on these w columns with rows having smaller ngrams than the current row: 您可以逐行浏览,找到用作连接键的“ w”列,然后对这些w列执行ngrams小于当前行的行的连接:

DT[, B := 
    {
        k <- as.integer(.BY) - 1L
        if (k > 0) {
            nm <- head(grep("^w", names(.SD)[!is.na(.SD)], value=TRUE), k)
            DT[n < .BY][.SD, x.Freq, on=nm]
        } else NA_real_
    },
    by=.(n)]

output: 输出:

               ngram  Freq n w1 w2 w3 w4     B
1:               how 15000 1 37 NA NA NA    NA
2:           how are  1500 2 37 13 NA NA 15000
3:       how are you   150 3 37 13  7 NA  1500
4: how are you doing    15 4 37 13  7 95   150

trimming code after Frank's comments: 在弗兰克发表评论后修剪代码:

DT[, B := 
    {
        if (n > 1L) {
            nm <- head(grep("^w", names(.SD)[!is.na(.SD)], value=TRUE), n-1L)
            DT[n==.BY$n-1L][.SD, x.Freq, on=nm]
        }
    },
    by=.(n)]

A variation on chinsoon's answer, overwriting the nth word to NA before joining: chinsoon答案的一种变体,在加入前将第n个单词覆盖为NA:

wcols = paste0("w", 1:4)    
DT[, v := 
  DT[n == .BY$n - 1L][replace(.SD, .BY$n, NA_real_), on=wcols, x.Freq]
, by=n, .SDcols=wcols]

That this approach, while more concise, is probably less efficient, since I am joining on all columns instead of just n-1 . 这种方法虽然更简洁,但效率可能较低,因为我将加入所有列,而不仅仅是n-1

I keyed n, made B a subset of DT, and reversed the order of the match: 我输入n,将B设为DT的子集,然后反转匹配的顺序:

setkey(DT,n)
DT[.(2),B:=DT[,Freq[match(w1[n==2L],w1[n==1L],nomatch=NA)]]]

> DT
               ngram  Freq n w1 w2 w3 w4     B
1:               how 15000 1 37 NA NA NA    NA
2:           how are  1500 2 37 13 NA NA 15000
3:       how are you   150 3 37 13  7 NA  1500
4: how are you doing    15 4 37 13  7 95   150

Works quickly on large data set. 快速处理大型数据集。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从另一个创建一个新的Data.table并在一个变量上匹配一个向量列表 - Create a new Data.table from another one and match a vector list on one variable 根据来自另一个向量的字符串匹配过滤 data.table - Filter data.table based on string match from another vector 根据 data.table 中的先前值和另一个变量填充变量 - Populating a variable based on previous values and another variable in a data.table 如何创建一个数据表,其中一行基于另一数据表填充NA - How to create a data.table with one row filled with NA based on another data.table 根据另一个不同维度的 data.table 中的多个条件,选择性地更改一个 data.table 中的列 - Selectively alter a column in one data.table based on multiple conditions in another data.table of different dimensions 在R中的data.table中基于另一个因素汇总一个因素 - Aggregating one factor based on another in data.table in R 如何从一个data.table中删除一个data.table中指定的组的组合[R] - How can I remove combinations of groups specified in one data.table from another data.table [R] 在data.table中跨组(不在组内)随机排序 - randomly ordering across groups (not within group) in data.table data.table 中组的第一个元素的累积和 - Cumulative sum across first element of groups in data.table 在 data.table 和/或 dplyr 中的组和列之间应用 function - Apply a function across groups and columns in data.table and/or dplyr
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM