简体   繁体   English

比较巨大数据集R上时间值的有效方法

[英]Efficient way to compare time values over huge dataset R

I am using R to carry out an analysis of Wikidata dumps. 我正在使用R对Wikidata转储进行分析。 I have previously extracted the variables I need from the XML dumps and create my own dataset in smaller csv files. 之前,我已经从XML转储中提取了所需的变量,并在较小的csv文件中创建了自己的数据集。 Here how my files look like. 这是我的文件的样子。

Q939818;35199259;2013-05-04T20:28:48Z;KLBot2;/* wbcreateclaim-create:2| */ [[Property:P373]], Tour de Pologne 2010
Q939818;72643278;2013-09-26T03:46:26Z;Coyau;/* wbcreateclaim-create:1| */[[Property:P107]]: [[Q1656682]]
Q939818;72643283;2013-09-26T03:46:28Z;Coyau;/* wbcreateclaim-create:1| */[[Property:P31]]: [[Q2215841]]
Q939818;90117273;2013-11-28T14:14:04Z;DanmicholoBot;/* wbsetlabel-add:1|nb */from the [no] label
Q939818;90117281;2013-11-28T14:14:07Z;DanmicholoBot;/* wbsetlabel-remove:1|no */
Q939818;92928394;2013-11-28T14:14:07Z;DanmicholoBot;/* wbsetlabel-remove:1|no */

Unfortunately, the script to extract the variables sometimes skips some tags, so in some lines the item ID (the first value) is not present and it is replaced by "wikimedia page". 不幸的是,提取变量的脚本有时会跳过一些标签,因此在某些行中不存在项目ID(第一个值),而是由“ wikimedia页面”代替。

I would like to infer the missing item IDs by checking the time in the third column: if the time in the line with the missing value is previous to the following one, then I can assume that the item IDs is the same (they are two revisions of the same value). 我想通过检查第三列中的时间来推断缺失的商品ID:如果缺失值所在行中的时间早于下一个,那么我可以假设商品ID相同(它们是两个相同值的修订版)。 Otherwise, the item ID will be the same as the previous line. 否则,商品ID将与前一行相同。

To do that, I wrote some code that first checks for all the lines with "wikimedia page" in the first column and then does what I have just described: 为此,我编写了一些代码,该代码首先检查第一列中带有“ wikimedia page”的所有行,然后执行我刚刚描述的操作:

wikimedia_lines <- grep("wikimedia page", wikiedits_clean$V1)

for (i in wikimedia_lines){
  if (wikiedits_clean$time[i] < wikiedits_clean$time[i + 1]) {
     wikiedits_clean$V1[i] <- wikiedits_clean$V1[i + 1] 
  }
  else {wikiedits_clean$V1[i] <- wikiedits_clean$V1[i - 1] }
}

However, since my files are quite big (~6.5M lines), it takes a lot of time to execvute the loop. 但是,由于我的文件很大(约650万行),因此执行循环需要大量时间。 Is there some more 'R-style' (like using apply or sapply) solution that could do that in a more efficient way? 还有更多的“ R风格”(例如使用apply或sapply)解决方案可以更有效地做到这一点吗?

Thank you. 谢谢。

I suggest the following: 我建议以下内容:

data <- read.table(filename,
                   sep=";",
                   header=TRUE,
                   colClasses=c("character","character","character","character","character"))

data$time <- as.POSIXct(data$time,format="%Y-%m-%dT%H:%M:%S")

m <- which( data$ID == "wikimedia page" )
n <- m[which( data$time[m]-data$time[m+1] >= 0 )]

cleanData <- data

cleanData$ID[n]             <- data$ID[n-1]
cleanData$ID[setdiff(m,n)]  <- data$ID[setdiff(m,n)+1]

"m" is the vector of row numbers where the "ID" is missing. “ m”是缺少“ ID”的行号的向量。 "n" is the vector of those row numbers in "m" where the time is not previous to the time in the next row. “ n”是“ m”中那些行号的向量,其中时间不早于下一行中的时间。

If there are missing ID's in consecutive rows, my previous solution couldn't fill all the gaps. 如果连续的行中缺少ID,我以前的解决方案将无法填补所有空白。 The following solution is more complicated, but it can handle this case: 以下解决方案更为复杂,但可以处理这种情况:

data <- read.table(filename,
                   sep=";",
                   header=TRUE,
                   colClasses=c("character","character","character","character","character"))

data$time <- as.POSIXct(data$time,format="%Y-%m-%dT%H:%M:%S")

m <- sort( which( data$ID == "wikimedia page" ) )
d <- diff(c(-1,m))
e <- diff(c(0,diff(m)==1,0))

b1 <- c(-Inf, m[which( e>0 | (d>1 & e==0) )], Inf)
b2 <- c(-Inf, m[which( e<0 | (d>1 & e==0) )], Inf)

k1 <- b1[unlist(lapply( m, function(x){ which.max(x<b1)-1 }))]
k2 <- b2[unlist(lapply( m, function(x){ which.max(x<=b2)  }))]

n1 <- which(((data$time[k2+1]-data$time[m]<0) & k1>1) | k2==nrow(data) )
n2 <- setdiff(1:length(m),n1)

cleanData <- data

cleanData$ID[m[n1]] <- data$ID[k1[n1]-1]
cleanData$ID[m[n2]] <- data$ID[k2[n2]+1]

As before, "m" is the vector of row numbers where the ID is missing. 如前所述,“ m”是缺少ID的行号的向量。 The vectors "b1" and "b2" contain those row numbers in "m" where a block of consecutive missing ID's starts and ends, respectively, ie the lower bounds and upper bounds of these blocks. 向量“ b1”和“ b2”包含“ m”中的那些行号,其中连续丢失的ID的块分别开始和结束,即这些块的下限和上限。 So "m" is the union of the intervals "b1[i]:b2[i]" where "i" runs from 1 to the length of "b1" and "b2". 因此,“ m”是区间“ b1 [i]:b2 [i]”的并集,其中“ i”从1到“ b1”和“ b2”的长度。 Also "k1" and "k2" contain these bounds, but they have the same length as "m" and "m[j]" is contained in the block "k1[j]:k2[j]" for each index "j". 同样,“ k1”和“ k2”包含这些边界,但是它们的长度与“ m”相同,并且对于每个索引“ j”,在块“ k1 [j]:k2 [j]”中包含“ m [j]” ”。 The ID in the "m[j]"'s row is set to one of the ID's in the "k1[j]-1"'s row or "k2[j]+1"'s row. “ m [j]”行中的ID设置为“ k1 [j] -1”行或“ k2 [j] +1”行中的ID之一。 The comparison of the time in the "m[j]"'s row with the time in the k2[j]+1"'s row, resulting in the vectors "n1" and "n2", decides which one is chosen. 将“ m [j]”行中的时间与k2 [j] +1”行中的时间进行比较,得出向量“ n1”和“ n2”,从而决定选择哪一个。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM