简体   繁体   English

将长字符串列表拆分为逗号分隔的向量,然后转换为 df 行

[英]Split long string list into comma delimited vector and then convert to a df row

I have gathered that this question is somewhat commonly asked, but I've hit a few snags that I can't seem to find an answer to.我已经收集到这个问题有些常见,但我遇到了一些我似乎无法找到答案的障碍。

I have a long string:我有一个很长的字符串:

line1 = "GGCTTATTTAACGGGCAGATATACGCTGGGCAAATC..."

I want it to look like:我希望它看起来像:

line1 = c("G", "G", "C", ...)

(As an aside, is it possible to have letters like above as integers - when I tried with the function as.integer, it converted it all to NAs?) (顺便说一句,是否可以将上述字母作为整数 - 当我尝试使用 function as.integer 时,它会将其全部转换为 NA?)

I have tried: strsplit(line1, "")我试过: strsplit(line1, "")

Which produces a list of: 'G''G''C'...这会产生一个列表: 'G''G''C'...

To solve this, I've tried: paste(line1, collapse = ", ")为了解决这个问题,我尝试过: paste(line1, collapse = ", ")

Which sort of works: c(\"G\", \"G\", \"C"...)哪种作品: c(\"G\", \"G\", \"C"...)

When I tried to remove the ' \ ' with gsub , it didn't let be do it, as it suddenly registered everything in the script as in quotes.当我尝试使用gsub删除“\”时,它并没有这样做,因为它突然将脚本中的所有内容都注册为引号。

Further, once this is done, I'd like to shape this into either a row or a column of a dataframe like so:此外,一旦完成,我想将其塑造成 dataframe 的一行或一列,如下所示:

   [1] [2] [3] ...
[1] G   G   C

Or:或者:

   [1]
[1] G
[2] G
[3] C

After splitting unlist the result, convert it to factor and then numeric:拆分 unlist 结果后,将其转换为因子,然后转换为数字:

fac <- factor(unlist(strsplit(line1, "")))
as.numeric(fac)
## [1] 5 5 4 6 6 3 6 6 6 3 3 4 5 5 5 4 3 5 3 6 3 6 3 4 5 4 6 5 5 5 4 3 3 3 6 4 1 2 2 2

# this gives the correspondence between numbers and characters
# i.e. space is 1, dot is 2, A is 3, C is 4, G is 5 and T is 6
levels(fac)
## [1] " " "." "A" "C" "G" "T"

The levels can also be specified explicitly using the levels= argument in which case other characters will be NA and optionally could be eliminated using na.omit(...).也可以使用levels= 参数显式指定级别,在这种情况下,其他字符将为NA,并且可以选择使用na.omit(...) 消除。

fac <- factor(unlist(strsplit(line1, "")), levels = c("A", "C", "G", "T"))
as.numeric(fac)
## [1]  3  3  2  4  4  1  4  4  4  1  1  2  3  3  3  2  1  3  1  4  1  4  1  2  3  2  4  3  3  3  2  1  1  1  4  2 NA NA NA NA

Note笔记

The input in the question is the following.问题中的输入如下。 Possibly the last 4 characters were not intended to be part of the data but if that were so then it ought to have been written that way so that others don't have to edit it.可能最后 4 个字符不打算成为数据的一部分,但如果是这样,那么它应该是这样写的,这样其他人就不必编辑它了。 In any case the code above should work.无论如何,上面的代码应该可以工作。

line1 = "GGCTTATTTAACGGGCAGATATACGCTGGGCAAATC ..." 

To convert that list to a character vector you can just go:要将该列表转换为字符向量,您只需 go:

x <- strsplit(line1, "")
x <- x[[1]]

To make it a column of a df you can either go:要使其成为 df 的列,您可以使用 go:

x <- as.data.frame(x)

Or just do it directly from the first line:或者直接从第一行开始:

x <- as.data.frame(strsplit(line1, ""))

That'll give it an ugly column header which you can fix with这会给它一个丑陋的列 header 你可以用它来修复

names(x)[1] <- 'whatever'

Or again directly in the one call:或者直接在一次通话中再次:

x <- as.data.frame(strsplit(line1, ""), col.names = 'whatever')

The question seems to ask for the output of dput but this is seldom needed.该问题似乎要求 dput 的dput但这很少需要。

x <- strsplit(line1, "")[[1]]
dput(x)
#c("G", "G", "C", "T", "T", "A", "T", "T", "T", "A", "A", "C", 
#"G", "G", "G", "C", "A", "G", "A", "T", "A", "T", "A", "C", "G", 
#"C", "T", "G", "G", "G", "C", "A", "A", "A", "T", "C")

As for the question on how to get integers from the string, here is a way.至于如何从字符串中获取整数的问题,这里有一种方法。 The output are the ASCII codes for the letters in the original line1 string. output 是原始line1字符串中字母的 ASCII 码。

charToRaw(line1)
# [1] 47 47 43 54 54 41 54 54 54 41 41 43 47 47 47 43 41 47 41 54 41 54 41
#[24] 43 47 43 54 47 47 47 43 41 41 41 54 43

Data数据

line1 <- "GGCTTATTTAACGGGCAGATATACGCTGGGCAAATC"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM