简体   繁体   中英

Copy number file format Issue

I have a problem with a .csv file from Copy number data. The original looks like this:

genes               Log2
PIK3CA,TET2          -0.35
MLH2,NRAS            0.54

And, what I need is:

genes                Log2

PIK3CA              -0.35
TET2                -0.35
MLH2                0.54
NRAS                0.54

I have tried many things by now, and they have not been successful. The file was created with CNVkit from gastric cancer samples. The file is much bigger, and the list of genes is longer, but this is essentially what I need to do in order to analyze our cnv data.

I have tried this:

awk -F , -v OFS='\t' 'NR == 1 || $0 > 0 {print $4}' copynumber.csv | less

Which is the closest i've got.

I use Linux, Ubuntu 16.04. I would appreciate if you could help me with an R or Python script, but, by now, any solution would be good.

We can use separate_rows from the tidyr package if you are using R.

library(tidyr)

dat2 <- dat %>% separate_rows(genes)
dat2
#    genes  Log2
# 1 PIK3CA -0.35
# 2   TET2 -0.35
# 3   MLH2  0.54
# 4   NRAS  0.54

DATA

dat <- read.table(text = "genes               Log2
PIK3CA,TET2          -0.35
                  MLH2,NRAS            0.54",
                  header = TRUE, stringsAsFactors = FALSE)

It can be easily achieved with python.
You can split a line by a space first and then iterate over multiple comma-separated fields.

filename = 'copynumber.csv'
with open(filename, 'r') as fp:
    header = fp.readline()
    print(header)
    for line in fp:
        keys, value = line.split()
        for key in keys.split(','):
            print(key + " " + value)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM