简体   繁体   中英

how to join multiple tab files by using python

I have multiple tab files with same name in different folders like this

F:/RNASEQ2019/ballgown/abundance_est/RBRN02.sorted.bam\t_data.ctab
F:/RNASEQ2019/ballgown/abundance_est/RBRN151.sorted.bam\t_data.ctab

Each file have 5-6 common columns and I want to pick up two columns- Gene and FPKM. Gene column is same for all only FPKM value differ. I want to pickup Gene and FPKM column form each file and make a master file like this

Gene RBRN02 RBRN03 RBRN151
gene1   67  699     88
gene2   66  77      89

I did this

import os

path ="F:/RNASEQ2019/ballgown/abundance_est/"

files =[]

## r=root, d=directory , f=file

for r, d, f in os.walk(path):
    for file in f:
        if 't_data.ctab' in file:
            files.append(os.path.join(r, file))

df=[]

for f in files:
    df.append(pd.read_csv(f, sep="\t"))

But this is not doing side wise merge. How do I get that above format? please help

IIUC, you can get your desired result with a simple list comprehension :

dfs = [pd.read_csv(f,sep='\t') for f in files]
df = pd.concat(dfs)
print(df)

or as a one liner

df = pd.concat([pd.read_csv(f,sep='\t') for f in files])

Using datatable , you can read multiple files at once by specifying the pattern:

import datatable as dt
dfs = dt.fread("F:/RNASEQ2019/ballgown/abundance_est/**/t_data.ctab",
               columns={"Gene", "FPKM"})

If there are multiple files, this will produce a dictionary where each key is the name of the file, and the corresponding value is the content of that file, parsed into a frame. The optional columns parameter limits which columns you want to read.

In your case it seems like you want to rename the columns based on the name of the file where it came from, so you may do something like this:

frames = []
for filename, frame in dfs.items():
    mm = re.search(r"(\w+)\.sorted\.bam", filename)
    frame.names = {"FPKM": mm.group(1)}
    frames.append(frame)

In the end, you can cbind the list of frames:

df = dt.cbind(frames)

If you need to work with a pandas dataframe, you can convert easily: df.to_pandas() .

如何在单独的数据框中读取每个文件然后合并它们?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM