简体   繁体   English

以更快的方式计算R中列中不同ID的特征

[英]Count features for different ids in columns in R in faster way

I am trying to process a 20 GB data file in R. I have 16 gigs RAM and i7 processor. 我正在尝试在R中处理20 GB的数据文件。我有16 gigs RAM和i7处理器。 I am reading the data using : 我正在使用读取数据:

y<-read.table(file="sample.csv", header = TRUE, sep = ",", skip =0, nrows =50000000)

The dataset 'y' is as follows : 数据集“ y”如下:

id    feature

21    234
21    290
21    234
21    7802
21    3467
21    234
22    235
22    235
22    1234
22    236
22    134
23    9133
23    223
23    245
23    223  
23    122
23    223 

So above is sample dataset, which shows different features for a particular id. 因此,上面是示例数据集,它显示了特定ID的不同功能。 I want to count how many times a particular feature listed in another dataset x has occurred for an id in y. 我想计算在y中的id发生在另一个数据集x中列出的特定特征的次数。

The dataset x is as follows: 数据集x如下:

id    feature

   21      234
   22      235
   23      223

And the final output that I want is as follows: 我想要的最终输出如下:

 id    feature_count

   21      3
   22      2
   23      3

As we see 234 occurred thrice for 21, 235 occurred twice for 22 and 223 occurred twice for 23. 我们看到234发生了三次,发生了21次,235发生了两次,发生22次,223发生了两次,发生23次。

For this I have tried getting positions where the new id starts: (eg 1st, 7th and 12th position for above sample) and then count a feature using a for loop as follows: 为此,我尝试获取新ID开始的位置:(例如,上面示例的第1、7和12个位置),然后使用for循环对特征进行计数,如下所示:

Getting positions of different ids 获取不同ID的位置

positions=0
positions[1]=1
j=2
for(i in 1:50000000){
    if(y$id[i]!=y$id[i+1]){
    positions[j]=i+1
    j=j+1
  }
}

Since the data is huge the looping is taking a lots of time.(for 50 Million rows it takes 321 secs on above mentioned config PC and I have 300 Million rows). 由于数据量巨大,因此循环会花费大量时间(对于5000万行,在上述配置PC上需要321秒,而我有3亿行)。

Counting the features that match with the given feature in 'x'.( x is the data frame specified above from which the features are to be matched with that of y .On being matched feature_count is incremented) 计算与'x'中给定特征匹配的特征。(x是上面指定的数据帧,从该数据帧中将特征与y进行匹配。被匹配时,feature_count递增)

for(i in 1 :length(positions)){
  for(j in positions[i]:positions[i+1]){
    if(y$feature[j]==x$feature[i]){         
       feature_count[i]=feature_count[i]+1
    }
  }
}

Are there any R functions which can collectively do this job for me in a faster time. 是否有任何R函数可以共同为我更快地完成这项工作。 Also incrementing for loop using "positions[i]:positions[i+1]" throws an error saying NA arguments in for loop. 同样使用“ positions [i]:positions [i + 1]”递增for循环会引发错误,指出NA变量在for循环中。 Please suggest a right way to do that too. 请提出正确的方法。

I would recommend the data.table package for this ( fread is very fast!), then set up a loop that loops through the file reading in chunks at a time and storing the feature count sums. 我会建议data.table软件包这一( fread是非常快的!),然后建立一个循环,通过文件一次读取的块和存储功能计数总和循环。 Here are some adapted lines of a function I have for looping for a file, it probably won't work as is, but you can get an idea what to do 这是我为循环文件而使用的函数的一些经过改编的行,它可能无法按原样工作,但是您可以知道该怎么做

require(data.table)
LineNu <- as.numeric(gsub(" .+","",system2("wc",paste("-l",your.file,sep=" "),stdout=TRUE, stderr=TRUE)))
DT <- fread(your.file,nrows=50000000,sep=",",header=TRUE)
KEEP.DT <- DT[,list("feature"=sum(length(feature))),by=id]
rm(DT) ; gc()
Starts <- c(seq(50000000,LineNu,by=50000000),LineNu)
for (i in 2:(length(Starts)-1)) {
  cat(paste0("Filtering next 50000000 lines    ", i, " of ",length(Starts)-1, " \n"))
  DT <- fread(your.file,skip=Starts[i],nrows=ifelse(50000000*(i-1) < Starts[length(Starts)],50000000,(50000000*(i-1)) - Starts[length(Starts)]),sep=",",header=FALSE)
  DT[,list("feature"=sum(length(feature))),by=id]
  KEEP.DT <- rbind(KEEP.DT,DT)
  rm(DT) ; gc()
}

You may need to redo the DT[sum(length)] part since some id's might get read in in different chunks. 您可能需要重做DT [sum(length)]部分,因为某些id可能会以不同的块形式读取。

I admit that I don't really understand the question the way it is written, but it sounds like "data.table" would be the way to go, and you should look into the .N function. 我承认我并不真正理解它的编写方式,但是听起来“ data.table”将是解决之道,您应该研究.N函数。 As already mentioned fread is going to be much better than read.csv , so I'll assume that you've read the data into a data.table named "DT". 正如已经提到的, fread会比read.csv ,所以我假设您已经将数据读入名为“ DT”的data.table

Here's a small one: 这是一个很小的:

DT <- data.table(id = c(rep(21, 6), rep(22, 5), 23, 23),
                 feature = c(234, 290, 234, 7802, 3467, 234, 235,
                             235, 1234, 236, 134, 9133, 223))
DT
#     id feature
#  1: 21     234
#  2: 21     290
#  3: 21     234
#  4: 21    7802
#  5: 21    3467
#  6: 21     234
#  7: 22     235
#  8: 22     235
#  9: 22    1234
# 10: 22     236
# 11: 22     134
# 12: 23    9133
# 13: 23     223

If you just wanted to count the number of each unique feature, you could do: 如果您只想计算每个独特功能的数量,则可以执行以下操作:

DT[, .N, by = "id,feature"]
#     id feature N
#  1: 21     234 3
#  2: 21     290 1
#  3: 21    7802 1
#  4: 21    3467 1
#  5: 22     235 2
#  6: 22    1234 1
#  7: 22     236 1
#  8: 22     134 1
#  9: 23    9133 1
# 10: 23     223 1

If you wanted the count of the first "feature", by "id", you could use: 如果要通过“ id”对第一个 “功能”进行计数,则可以使用:

DT[, .N, by = "id,feature"][, .SD[1], by = "id"]
#    id feature N
# 1: 21     234 3
# 2: 22     235 2
# 3: 23    9133 1

If you wanted to get the most frequently occurring "feature" by "id" (which is the same result as above, in this case), you can try the following: 如果要通过“ id”获得最频繁出现的“功能”(在这种情况下,与上面的结果相同),可以尝试以下操作:

DT[, .N, by = "id,feature"][, lapply(.SD, function(x) x[which.max(N)]), by = "id"]

Update 更新资料

Based on your new description, this seems much easier. 根据您的新描述,这似乎容易得多。

Just merge your datasets and aggregate the counts. 只需merge您的数据集并aggregate计数即可。 Again, fast to do in "data.table": 再次,在“ data.table”中快速完成:

DTY <- data.table(y, key = "id,feature")
DTX <- data.table(x, key = "id,feature")
DTY[DTX][, .N, by = id]
#    id N
# 1: 21 3
# 2: 22 2
# 3: 23 3

Or: 要么:

DTY[, .N, by = key(DTY)][DTX]
#    id feature N
# 1: 21     234 3
# 2: 22     235 2
# 3: 23     223 3

This is assuming that "x" and "y" are defined as the following to begin with: 假设“ x”和“ y”的定义如下:

x <- structure(list(id = 21:23, feature = c(234L, 235L, 223L),
  counts = c(3L, 2L, 3L)), .Names = c("id", "feature", "counts"),
  row.names = c(NA, -3L), class = "data.frame")
y <- structure(list(id = c(21L, 21L, 21L, 21L, 21L, 21L, 22L, 22L, 
  22L, 22L, 22L, 23L, 23L, 23L, 23L, 23L, 23L), feature = c(234L,
  290L, 234L, 7802L, 3467L, 234L, 235L, 235L, 1234L, 236L, 134L,
  9133L, 223L, 245L, 223L, 122L, 223L)), .Names = c("id", "feature"),
  class = "data.frame", row.names = c(NA, -17L))

For your example: 例如:

apply(sign(table(y)), 1, sum)
21 22 23 
 4  4  2 

How about table()? table()呢?

> set.seed(5)
> ids <- sample(1:3, 12, TRUE)
> features <- sample(1:4, 12, TRUE)
> cbind(ids, features)
      ids features
 [1,]   1        2
 [2,]   3        3
 [3,]   3        2
 [4,]   1        1
 [5,]   1        2
 [6,]   3        4
 [7,]   2        3
 [8,]   3        4
 [9,]   3        4
[10,]   1        3
[11,]   1        1
[12,]   2        1

> table(ids, features)
   features
ids 1 2 3 4
  1 2 2 1 0
  2 1 0 1 0
  3 0 1 1 3

So for example feature 4 appears 3 times in id 3. 例如,功能4在ID 3中出现3次。

EDIT : You can use as.data.frame() to "flatten" the table and get: 编辑 :您可以使用as.data.frame()来“展平”表并获取:

> as.data.frame(table(ids, features))
   ids features Freq
1    1        1    2
2    2        1    1
3    3        1    0
4    1        2    2
5    2        2    0
6    3        2    1
7    1        3    1
8    2        3    1
9    3        3    1
10   1        4    0
11   2        4    0
12   3        4    3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM