简体   繁体   English

如果任何行包含特定字符串,则从数据框中删除列

[英]Remove columns from data frame if any row contains a specific string

I would like to remove columns which contain the string -- in any row.我想删除任何行中包含字符串--的列。

Number  138 139 140 141 143 144 147 148 149 150 151 152 14  15  N…  
nm4804  A   B   --  A   B   A   A   --  A   A   A   A   A   --  A  
nm7574  B   A   A   A   A   A   A   A   A   A   A   A   A   --  A
nm8723  B   --  B   B   B   --  A   --  B   B   B   B   --  --  A
N…      B   A   A   A   A   B   A   --  A   A   B   --  --  --  A

I would like to count the -- frequency, if there is any column have more than 50% of -- in the columns, that column will be removed.我想计算 -- 频率,如果有任何列有超过 50% 的 -- 在列中,该列将被删除。

Desired result:期望的结果:

Number  138 140 141 143 147 149 150 151 N…  
nm4804  A   A   --    B A   A   A   A   A  
nm7574  B   A   A    A  A   A   A   A   A
nm8723  B   B   A    B  --    B  B  B   A
N…          B   A   A    A  A A A   B   A

Data (thanks bgoldst)数据(感谢 bgoldst)

df <- data.frame(Number=c('nm4804','nm7574','nm8723','N…'),`138`=c('A','B','B','B'),`139`=c(
'B','A','--','A'),`140`=c('--','A','B','A'),`141`=c('A','A','B','A'),`143`=c('B','A','B','A'
),`144`=c('A','A','--','B'),`147`=c('A','A','A','A'),`148`=c('--','A','--','--'),`149`=c('A',
'A','B','A'),`150`=c('A','A','B','A'),`151`=c('A','A','B','B'),`152`=c('A','A','B','--'),
`14`=c('A','A','--','--'),`15`=c('--','--','--','--'),`N…`=c('A','A','A','A'),check.names=F,
stringsAsFactors=F);

Use colSums() : 使用colSums()

df[,colSums(df=='--')==0]
##   Number 138 141 143 147 149 150 151 N…
## 1 nm4804   A   A   B   A   A   A   A  A
## 2 nm7574   B   A   A   A   A   A   A  A
## 3 nm8723   B   B   B   A   B   B   B  A
## 4     N…   B   A   A   A   A   A   B  A

We can also use Filter 我们也可以使用Filter

Filter(function(x) !any(x=="--"), df1)
#    Number X138 X141 X143 X147 X149 X150 X151 N…
#1 nm4804    A    A    B    A    A    A    A  A
#2 nm7574    B    A    A    A    A    A    A  A
#3 nm8723    B    B    B    A    B    B    B  A
#4     N…    B    A    A    A    A    A    B  A

If we need to remove the columns with more than 50% of -- 如果我们需要删除超过50%的列--

Filter(function(x) mean(x == '--') <= 0.5, df1)

NOTE: Based on the OP's example, all the columns will be retained here. 注意:根据OP的示例,所有列都将保留在此处。

Since it is unclear in the question, I'm assuming that nm4804 and such are row names, and 138..152 are column names, not actual data. 由于在问题中不清楚,我假设nm4804等行名称, 138..152是列名,而不是实际数据。 With that, I'm guessing that this is a character matrix. 有了它,我猜这是一个字符矩阵。 Your data: 你的数据:

dat <- structure(c("A", "B", "B", "B", "B", "A", "--", "A", "--", "A", 
"B", "A", "A", "A", "B", "A", "B", "A", "B", "A", "A", "A", "--", 
"B", "A", "A", "A", "A", "--", "A", "--", "--", "A", "A", "B", 
"A", "A", "A", "B", "A", "A", "A", "B", "B", "A", "A", "B", "--", 
"A", "A", "--", "--", "--", "--", "--", "--", "A", "A", "A", 
"A"), .Dim = c(4L, 15L), .Dimnames = list(c("nm4804", "nm7574", 
"nm8723", "N..."), c("138", "139", "140", "141", "142", "143", 
"144", "145", "146", "147", "148", "149", "150", "151", "152"
)))

Try this: 尝试这个:

dat[,! apply(dat, 2, `%in%`, x = "--")]
#        138 141 142 144 146 147 148 152
# nm4804 "A" "A" "B" "A" "A" "A" "A" "A"
# nm7574 "B" "A" "A" "A" "A" "A" "A" "A"
# nm8723 "B" "B" "B" "A" "B" "B" "B" "A"
# N...   "B" "A" "A" "A" "A" "A" "B" "A"

Here is a proposal using dplyr using the 'dat' dataframe proposed by @r2evans这是一个使用 dplyr 的提案,使用 @r2evans 提出的 'dat' dataframe

dat <- structure(c("A", "B", "B", "B", "B", "A", "--", "A", "--", "A", 
"B", "A", "A", "A", "B", "A", "B", "A", "B", "A", "A", "A", "--", 
"B", "A", "A", "A", "A", "--", "A", "--", "--", "A", "A", "B", 
"A", "A", "A", "B", "A", "A", "A", "B", "B", "A", "A", "B", "--", 
"A", "A", "--", "--", "--", "--", "--", "--", "A", "A", "A", 
"A"), .Dim = c(4L, 15L), .Dimnames = list(c("nm4804", "nm7574", 
"nm8723", "N..."), c("138", "139", "140", "141", "142", "143", 
"144", "145", "146", "147", "148", "149", "150", "151", "152"
)))

This enables to remove all columns containing more than 50% of '--'这使得能够删除包含超过 50% 的 '--' 的所有列

dat %>% 
as.data.frame() %>% 
select_if(~!(sum(.=="--") / length(.) > 0.5))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM