简体   繁体   English

R中的列计数。刚开始将其用于GWAS,我迷路了

[英]Column counting in R. Just started using it for GWAS and I am lost

Can anyone help me working out how to count the number of instances of a character in a cell per row? 谁能帮我解决如何计算每行单元格中字符实例的数量吗? I have a file with 10 million snps that I want to sort. 我有一个要排序的1000万个snps文件。

Direction
?????+-+-
?+-+-????
?-+-+??-+

Above is an example of one of many columns that I have. 上面是我拥有的许多专栏之一的示例。 What I want to do is count the number of "?" 我要计算的是“?”的数量 characters in each row individually and add a new column with that count as a numerical value. 每行中的每个字符,然后添加一个新列,并将该计数作为数值。

I'm a total beginner thrown in the deep end with this so any help would be appreciated. 我完全是个初学者,对此深有体会,因此我们将不胜感激。

Thanks. 谢谢。

Two answers for you 给你两个答案

a <- data.frame(direction = c("?????+-+-", "?+-+-????","?-+-+??-+"),  
 stringAsFactors = F)
a$return <- lengths(regmatches(a$direction, gregexpr("\\?", a$direction)))

or as per comments 或根据评论

a$return <- nchar(gsub("[^?]", "", a$direction))

Both return 都回来了

'data.frame':   3 obs. of  2 variables:
 $ direction: chr  "?????+-+-" "?+-+-????" "?-+-+??-+"
 $ return   : int  5 5 3

There are tons of ways to do this depends on what you're looking for. 有很多方法可以做到这一点,取决于您要寻找的东西。

Update 更新资料

While it may not be base R, the packages in the tidyverse are useful for data wrangling and can be used to string together a few calls easily. tidyverse中的程序包可能不是以R为基数的,但它们对于数据整理很有用,可用于轻松地将几个调用串在一起。

install.packages("dplyr")
library(dplyr)
df <- data.frame(Direction = c("???????????-?", "???????????+?", "???????????+?", "???????????-?"), stringsAsFactors = F)
df %>% 
  mutate(qmark = nchar(gsub("[^?]", "", Direction)),
         pos = nchar(gsub("[^+]", "", Direction)),
         neg = nchar(gsub("[^-]", "", Direction)),
         qminus = qmark-(pos+neg),
         total = nchar(Direction))  


      Direction qmark pos neg qminus total
1 ???????????-?    12   0   1     11    13
2 ???????????+?    12   1   0     11    13
3 ???????????+?    12   1   0     11    13
4 ???????????-?    12   0   1     11    13

If your dataset is 10 million lines long however, you might want to use stringi based on some benchmark testing . 但是,如果数据集的长度为1000万行,则可能需要根据一些基准测试使用stringi

install.packages("stringi")
library(stringi)
df %>% 
  mutate(qmark = stri_count(Direction, fixed = "?"),
         pos = stri_count(Direction, fixed = "+"),
         neg = stri_count(Direction, fixed = "-"), 
         qminus = qmark-(pos+neg))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 R。 我正在尝试将我的数据框子集几十年。 因此,我想通过使用列的值进行子集化 - R. I am trying to subset my data frame by decades. Therefore I want to subset by using values of a column 我是 r 的新手。我试图通过使用 for 循环使我的代码不那么复杂 - I am new to r. I am trying to make my code less complicated by using a for loop 我在R中使用朴素贝叶斯(Naive Bayes)。 - I am using Naive Bayes in R. I have the titanic set, but predict() function produces error 我正在尝试使用 R 获取矩阵中列的乘积。 我究竟做错了什么? - I'm trying to take the product of the columns in a matrix using R. What am I doing wrong? 在 R 中使用多核来分析 GWAS 数据 - Using multicore in R to analyse GWAS data 是否可以抓取特定主题的所有谷歌学术搜索结果,是否合法? - Is it possible to scrape all google scholar results on a particular topic and is it legal? R中的快速傅立叶变换。我在做什么错? - Fast Fourier Transform in R. What am I doing wrong? 我在 R 中有一个因子列。 我正在尝试删除它,但我不断收到错误消息。 如何删除 R 中的因子列? - I have a factor columns in R. I am trying to delete it but I keep getting errors messages. How do I delete a factor column in R? RFM分析-在R.Missing列中使用ddply - RFM analysis - using ddply in R. Missing column R. 如何创建一个新列,根据 R 中的另一列返回 i - R. How to create a new column, returning i based on another column in R
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM