简体   繁体   中英

How to use dictionaries with a list to synthesize and print inputs from multiple files into a single line for each key in output?

Let's say I have a file that looks like this:

Gene.name Experiment.1 Experiment.2
A1BG 0.031474 0.05776
ZNF621 0.091025 0.33516
ARHGAP12 0.97852 0.14098

and so on…

And another file that looks like this:

Gene Name Gene description Chromosome number Chromosome location
A1BG alpha-1B-glycoprotein 19 19q13.43
A2M alpha-2-macroglobulin 12 12p13.31
A3M alpha-3-macroglobulin 12 12p13.33

and so on…

I have made 2 dictionaries, one that matches the gene name (key) with the gene annotation/description (value), and another that matches the gene name (key) with gene chromosome number (value).

My goal is to make an output file where I take the first table (the one with experiment.1 and experiment.2 as the columns) and append the gene chromosome and gene annotation information to the the table for each appropriate gene using the dictionaries I have created. So essentially, this would lead to an output file in the following format for every gene present in both files. If one gene is not present in a file, the last 2 fields should be NAs (like the second column in the below example)

Gene.name Experiment.1 Experiment.2 Gene description Chromosome number
A1BG 0.03147 0.05776 alpha-1B-glycoprotein 19
ZNF621 0.091025 0.33516 N/A N/A

I have set my dictionaries up in the following manner:

infile = open("human_gene_annotations.txt", "rt")

#separate header
gene_header = infile.readline()

#gene annotation dict
gene_annotations = {}
#use for loop to fill
for line in infile:
    line = line.rstrip()
    information = line.split("\t") 
    gene_annotations[information[0]] = {"Gene Description": information[1]}
#close infile 
infile.close()


#open infile again for second dictionary 
infile = open("human_gene_annotations.txt", "rt")
#separate header
gene_header = infile.readline()

#gene chroms dict
gene_chroms = {}
#use for loop to fill
for line in infile:
    line = line.rstrip()
    info_chrom = line.split("\t") 
    gene_chroms[info_chrom[0]] = {"Chromosome Number": info_chrom[2]}
#close infile 
infile.close()

I have parsed the data from the first table (the one from the experiments) into lists like so:

genes = [] 
exp1values = []
exp2values = []

for line in infile:
    line = line.rstrip()
    fields = line.split("\t") # this will split the line we read by tab, thus by "column"
    genes.append(fields[0])
    exp1values.append(fields[1])
    exp2values.append(fields[2])

Why not create a dictionary for the first table as well

I am using your existing code block that you have used for your for the second table with just one exception. As a value to the gene description dictionary and the chromosome number dictionary, I will just store the number and not the respective texts

infile = open("human_gene_annotations.txt", "rt")

#separate header
gene_header = infile.readline()

#gene annotation dict
gene_annotations = {}
#use for loop to fill
for line in infile:
    line = line.rstrip()
    information = line.split("\t")
    gene_annotations[information[0]] = information[1] 
#close infile 
infile.close()


#open infile again for second dictionary 
infile = open("human_gene_annotations.txt", "rt")
#separate header
gene_header = infile.readline()

#gene chroms dict
gene_chroms = {}
#use for loop to fill
for line in infile:
    line = line.rstrip()
    info_chrom = line.split("\t") 
    gene_chroms[info_chrom[0]] = info_chrom[2]
#close infile 
infile.close()

Now for the first table I will make another 2 dictionaries for the two experiments


exp1map= {}
exp2map= {}

for line in infile:
    line = line.rstrip()
    fields = line.split("\t") # this will split the line we read by tab, thus by "column"
    exp1map[fields[0]]=fields[1]
    exp2map[fields[0]]=fields[2]

Now I do not know how exactly you want the output for output table but I am assuming you want to write the data to tab separated file

#Create a unique set of all genes from both table 1 and table 2
all_genes = set().union(*[exp1map.keys(),gene_annotations.keys()])

with open('output', 'w') as f:
   f.write('\t'.join(['Gene.name','Experiment.1','Experiment.2','Gene description','Chromosome location']) + '\n')

   for gene in all_genes:
      exp1=exp1map.get(gene,None)
      exp2=exp2map.get(gene,None)
      desc=gene_annotations.get(gene,None)
      chrom=gene_chroms.get(gene,None)

      f.write('\t'.join([exp1,exp2,desc,chrom]) + '\n')
   

I could not test my code as I did not have the dataset but I think it solves your problem. Let me know if I you need more help or I have made a mistake

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM