简体   繁体   English

如何根据列中的值从data.table中删除列

[英]How to delete columns from a data.table based on values in column

Background 背景

I have some financial data (1.5 years SP500 stocks) that I have manipulated into a wide format using the data.table package. 我有一些财务数据(1。5年的SP500股票),我已经使用data.table包操作为宽格式。 After following the whole data.table course on Datacamp, I'm starting to get a hang of the basics, but after searching for hours I'm at a loss on how to do this. 在关注Datacamp的整个data.table课程之后,我开始了解基础知识,但在搜索了几个小时后,我对如何做到这一点感到茫然。

The Problem 问题

The data contains columns with financial data for each stock. 数据包含每个库存的财务数据列。 I need to delete columns that contain two consecutive NA s. 我需要删除包含两个连续NA的列。

My guess is I have to use rle() , lapply() , to find consecutive values and DT[,x:= NULL] ) to delete the columns. 我的猜测是我必须使用rle()lapply()来查找连续值和DT[,x:= NULL] )来删除列。

I read that rle() doesn't work on NA s, so I changed them to Inf instead. 我读到rle()不适用于NA ,所以我将它们改为Inf I just don't know how to combine the functions so that I can efficiently remove a few columns among the 460 that I have. 我只是不知道如何组合这些功能,以便我可以有效地删除460中的几列。

An answer using data.table would be great, but anything that works well is very much appreciated. 使用data.table的答案会很棒,但任何效果都很好的人都非常感激。

Alternatively I would love to know how to remove columns containing at least 1 NA 或者,我很想知道如何删除包含至少1个NA的列

Example data 示例数据

> test[1:5,1:5,with=FALSE]
         date     10104     10107     10138     10145
1: 2012-07-02  0.003199       Inf  0.001112 -0.012178
2: 2012-07-03  0.005873  0.006545  0.001428       Inf
3: 2012-07-05       Inf -0.001951 -0.011090       Inf
4: 2012-07-06       Inf -0.016775 -0.009612       Inf
5: 2012-07-09 -0.002742 -0.006129 -0.001294  0.005830
> dim(test)
[1] 377 461

Desired outcome 期望的结果

         date     10107     10138
1: 2012-07-02       Inf  0.001112
2: 2012-07-03  0.006545  0.001428
3: 2012-07-05 -0.001951 -0.011090
4: 2012-07-06 -0.016775 -0.009612
5: 2012-07-09 -0.006129 -0.001294

PS. PS。 This is my first question, I have tried to adhere to the rules, if I need to change anything please let me know. 这是我的第一个问题,我试图遵守规则,如果我需要改变任何事情请告诉我。

Here's an rle version: 下面是一个rle版本:

dt[, sapply(dt, function(x)
       setDT(rle(is.na(x)))[, sum(lengths > 1 & values) == 0]), with = F]

Or replace the is.na with is.infinite if you like. 或者如果你愿意,用is.infinite替换is.na

To detect and delete columns containing atleast one NA, you can try the following 要检测和删除包含至少一个NA的列,您可以尝试以下操作

data = data.frame(A=c(1,2,3,4,5), B=c(2,3,4,NA,6), C=c(3,4,5,6,7), D=c(4,5,NA,NA,8))

colsToDelete = lapply(data, FUN = function(x){ sum(is.na(x)) >= 1 })

data.formatted = data[,c(!unlist(colsToDelete))]

Obviously the issue is finding consecutive missing. 显然问题是连续失踪。 First, create a matrix TRUE/FALSE based on missing NA . 首先,根据缺失的NA创建一个矩阵TRUE/FALSE Use that matrix to compare each row to next. 使用该矩阵将每行与下一行进行比较。 Keep columns in original matrix where colSums == 0 将列保留在原始矩阵中,其中colSums == 0

Try this: 试试这个:

Missing.Mat <- apply(test, 2, is.na)
Consecutive.Mat <- Missing.Mat[-nrow(Missing.Mat),] * Missing.Mat[-1,]
Keep.Cols <- colSums(Consecutive.Mat) == 0

test[,Keep.Cols]

This is what I came up with. 这就是我提出的。 It calls rle on a vector y that is 1:length(column) unless a corresponding element of the column is Inf , in which case the corresponding value in y is zero. 除非1:length(column)的对应元素是Inf ,否则它会在向量y上调用rle ,该向量为1:length(column) ,在这种情况下,y中的对应值为零。 Then it checks if any of the runs are greater than 1. 然后它检查是否有任何运行大于1。

keep <- c(date = T, apply(dat[, -1], 2,
              function(x) {
                y <- 1:length(x)
                y[!is.finite(x)] <- 0
                return(!any(rle(y)$lengths > 1))
              }))

dat2 <- dat[, keep]
dat2
#         date    X10107    X10138
# 1 2012-07-02       Inf  0.001112
# 2 2012-07-03  0.006545  0.001428
# 3 2012-07-05 -0.001951 -0.011090
# 4 2012-07-06 -0.016775 -0.009612
# 5 2012-07-09 -0.006129 -0.001294

Note that the column names are prepended with an "X" by read.table . 请注意,列名称前缀为read.table的“X”。

Now, the dput of the data: 现在,数据的输入:

dat <- structure(list(date = c("2012-07-02", "2012-07-03", "2012-07-05", 
"2012-07-06", "2012-07-09"), X10104 = c(0.003199, 0.005873, Inf, 
Inf, -0.002742), X10107 = c(Inf, 0.006545, -0.001951, -0.016775, 
-0.006129), X10138 = c(0.001112, 0.001428, -0.01109, -0.009612, 
-0.001294), X10145 = c(-0.012178, Inf, Inf, Inf, 0.00583)), .Names = c("date", 
"X10104", "X10107", "X10138", "X10145"), class = "data.frame", row.names = c(NA, 
-5L))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 R: Add boolean column to a data.table based on return values of a function which evaluates two columns from different data.table - R: Add boolean column to a data.table based on return values of a function which evaluates two columns from different data.table 如何根据其他列的值在 data.table 中创建新列 - How to create a new column in data.table based on values of other columns 如何基于data.table中的其他列创建新列? - How to create a new column based on other columns in a data.table? 如何基于data.table中的其他列创建索引列? - How to create an indexed column based in other columns in a data.table? R:如何删除data.table中的列? - R: how to delete columns in a data.table? 如何删除 data.table 中的多个列? - How to delete multiple columns in data.table? 如何使用基于正则表达式的列表中的值向data.table添加列 - How to add column to data.table with values from list based on regex R data.table如何基于另一列的值从多个列之一(按列NAME)获取VALUE - R data.table How to obtain the VALUE from one of many columns (by column NAME), based on the value of another column 根据列值更新data.table中的列值 - Updating column values in data.table, based on column values data.table:根据指标列值和名称创建新字符列 - data.table: Create new character column based on indicator columns values and names
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM